# Variational Autoencoders 1: Overview

In a previous post, we briefly mentioned some recent approaches for Generative Modeling. Among those, RBMs and DBMs are probably very tricky because the estimation of gradients in those models is based on a good mixing of MCMC, which tends to get worse during the course of training because the model distribution gets sharper. Autogressive models like PixelRNN, WaveNet, etc… are easier to train but have no latent variables, which makes them somewhat less powerful. Therefore, the current frontier in Generative Modelling is probably GANs and Variational Autoencoders (VAEs).

While GANs are too mainstream, I thought I can probably write a post or two about Variational Autoencoders, at least to clear up some confusions I am having about them.

Formally, generative modeling is the area in Machine Learning that deals with models of distributions $P(X)$, defined over datapoints $X$ in some high-dimensional space $\mathcal{X}$. The whole idea is to construct models of $P(X)$ that assigns high probabilities to data points similar to those in the training set, and low probabilities every where else. For example, a generative models of images of cows should assign small probabilities to images of human.

However, computing the probability of a given example is not the most exciting thing about generative models. More often, we want to use the model to generate new samples that look like those in the training set. This “creativity” is something unique to generative models, and does not exist in, for instance, discriminative models. More formally, say we have a training set sampled from an unknown distribution $P_\text{org}(X)$, and we want to train a model $P$ which we can take sample from, such that $P$ is as close as possible to $P_\text{org}$.

Needless to say, this is a difficult problem. To make it tractable, traditional approaches in Machine Learning often have to 1) make strong assumptions about the structure of the data, or 2) make severe approximation, leading to suboptimal models, or 3) rely on expensive sampling procedures like MCMC. Those are all limitations which make Generative modeling a long-standing problem in ML research.

Without further ado, let’s get to the point. When $\mathcal{X}$ is a high-dimensional space, modeling is difficult mostly because it is tricky to handle the inter-dependencies between dimensions. For instance, if the left half of an image is a horse then probably the right half is likely another horse.

To further reduce this complexity, we add a latent variable $z$ in a high-dimensional space $\mathcal{Z}$ that we can easily sample from, according to a distribution $P(z)$ defined over $\mathcal{Z}$. Then say we have a family of deterministic function $f(z;\theta)$ parameterized by a vector $\theta$ in some space $\Theta$ where $f: \mathcal{Z} \times \Theta \rightarrow \mathcal{X}$. Now $f$ is deterministic, but since $z$ is a random variable, $f(z;\theta)$ is a random variable in $\mathcal{X}$.

During inference, we will sample $z$ from $P(z)$, and then train $\theta$ such that $f(z;\theta)$ is close to samples in the training set. Mathematically, we want to maximize the following probability for every sample $X$ in the training set:

$\displaystyle P(X) = \int P\left(X\vert z; \theta\right)P(z)dz$   (1)

This is the good old maximum likelihood framework, but we replace $f(z;\theta)$ by $P\left(X\vert z;\theta\right)$ (called the output distribution) to explicitly indicate that $X$ depends on $z$, so that we can use the integral to make it a proper probability distribution.

There are a few things to note here:

• In VAEs, the choice of the output distribution is often Gaussian, i.e. $P\left(X\vert z;\theta\right) = \mathcal{N}\left(X; f(z;\theta), \sigma^2 * I\right)$, meaning it is a Gaussian distribution with mean $f(z;\theta)$ and diagonal covariance matrix where $\sigma$ is a scalar hyper-parameter. This particular choice has some important motivations:
• We need the output distribution to be continuous, so that we can use gradient descent on the whole model. It wouldn’t be possible if we use discontinuous function like the Dirac distribution, meaning to use exactly the output value of $f(z;\theta)$ for $X$.
• We don’t really need to train our model such that $f(z;\theta)$ produces exactly some sample X in the training set. Instead, we want it to produce samples that are merely like X. In the beginning of training, there is no way for $f$ to gives exact samples in the training set. Hence by using a Gaussian, we allow the model to gradually (and gently) learn to produce samples that are more and more like those in the training set.
• It doesn’t have to be Gaussian though. For instance, if $X$ is binary, we can make $P\left(X\vert z;\theta\right)$ a Bernoulli parameterized by $f(z;\theta)$. The important property is that $P\left(X\vert z\right)$ can be computed, and is continuous in the domain of $\theta$.
• The distribution of $z$ is simply the normal distribution, i.e. $P(z) = \mathcal{N}\left(0,I\right)$. Why? How is it possible? Is there any limitation with this? A related question is why don’t we have several levels of latent variables. which potentially might help modelling complicated processes?
All those question can be answered by the key observation that any distribution in $d$ dimensions can be generated by taking $d$ variables from the normal distribution and mapping them through a sufficiently complicated function.
Let that sink for a moment. Readers who are interested in the mathematical details can have a look at the conditional distribution method described in this paper. You can also convince yourself if you remember how we can sample from any Gaussian as described in an earlier post.
Now, this observation means we don’t need to go to more than one level of latent variable, with a condition that we need a sufficiently complicated function for $f(z;\theta)$. Since deep neural nets has been shown to be a powerful function approximator, it makes a lot of sense to use deep neural nets for modeling $f$.
• Now the only business is to maximize (1). Using the law of large numbers, we can approximate the integral by the expected value over a large number of samples. So the plan will be to take a very large sample $\left\{z_1, ..., z_n\right\}$ from $P(z)$, then compute $P(X) \approx \frac{1}{n}\sum_i P\left(X\vert z_i;\theta\right)$. Unfortunately the plan is infeasible because in high dimensional spaces, $n$ needs to be very large in order to have a good enough approximation of $P(X)$ (imagine how much samples you would need for $200 \times 200 x 3$ images, which is in 120K dimensional space?)
Now the key to realize is that we don’t need to sample $z$ from all over $P(z)$. In fact, we only need to sample $z$ such that $f(z;\theta)$ is more likely to be similar to samples in the training set. Moreover, it is likely that for most of $z$, $P(X\vert z)$ is nearly zero, and therefore contribute very little into the estimation of $P(X)$. So the question is: is there any way to sample $z$ such that it is likely to generate $X$, and only estimate $P(X)$ from those?
It is the key idea behind VAEs.

That’s quite enough for an overview. Next time we will do some maths and see how we go about maximizing (1). Hopefully I can then convince you that VAEs, GANs and GSNs are really not far away from each other, at least in their core ideas.

So I read the Learning to learn paper a while ago, and I was surprised that the Decoupled Neural Interfaces paper didn’t cite them. For me the ideas are pretty close, where you try to predict the gradient used in each step of gradient descent, instead of computing it by backpropagation. Taking into account that they are all from DeepMind, won’t it be nice to cite each other and increase the impact factors for both of them?

Nevertheless, I enjoyed the paper. The key idea is instead of doing a normal update $\theta_{t+1} = \theta_{t} - \alpha_t \nabla f\left(\theta_t\right)$, we do it as $\theta_{t+1} = \theta_{t} + g_t\left(\nabla f\left(\theta_t\right), \phi\right)$ where $g_t$ is some function parameterized by $\phi$.

Now one can use any function approximator for $g_t$ (called optimizer, to differentiate with $f\left(\theta\right)$ – the optimizee), but using RNNs has a particular interesting intuition as we hope that the RNNs can remember the gradient history and mimic the behavior of, for instance, momentum.

The convenient thing about this framework is that the objective function for training the optimizer is the expected weighted sum of the output of the optimizee $f\left(\theta\right)$. Apart from this main idea, everything else is nuts and bolts, which of course are equivalently important.

The first obstacle that they had to solve is how to deal with big models of perhaps millions of parameters. In such cases, $g_t$ has to input and output vector of millions of dimensions. Instead, the authors solved this problem very nicely by only working with one parameter at a time, i.e. the optimizer only takes as input one element of the gradient vector and output the update for that element. However, since the optimizer is a LSTM, the state of the gradient coordinates are maintained separately. This also has a nice side effect that  it reduces the size of the optimizer, and you can potentially re-use the optimizer for different optimizees.

The next two nuts and bolts are not so obvious. To mimic the L2 gradient clipping trick, they used the so-called global averaging cell (GAC), where the outgoing activations of LSTM cells are averaged at each step across all coordinates. To mimic Hessian-based optimization algorithms, they wire the LSTM optimizer with an external memory unit, hoping that the optimizer will learn to store the second-order derivatives in the memory.

Although the experimental results look pretty promising, many people pose some doubts about the whole idea of learning to learn. I was in the panel discussion of Learning to learn at NIPS, and it wasn’t particularly fruitful (people were drinking sangria all the time). It will be interesting to see the follow-ups on this line of work, if there is any.

# From self-driving cars to model and dataset sizes

So I am done with teaching a vehicle to drive itself!

Errh, not quite there yet. I did it on a simulator, in an easy environment where there is only one lane, and no other traffic. This is very far from an actual self-driving vehicle.

Nevertheless, I had a lot of fun. It was actually way easier than I initially thought. It is simply a regression problem, where a CNN was trained to predict the steering angle. A vanila CNN with a significant amount of training data would do the job quite easily. Although it sounds simple, eventually this is how nVidia drives a car with their DAVE-2 system.

In practice, self-driving car is a bit more complicated. For example, nVidia’s paper didn’t show how they would handle traffic lights. I guess the Deep Learning way for that would be to collect a lot more data at crossroads, but I feel that would not be enough. At some point, you will need traditional engineering methods like sensor fusion to precisely locate the car on the road (more precise than what GPS provides), path finding for planning and all kinds of other signals.

However, every time I apply Deep Learning to a new domain, I learned something new. For this project, it is the following:

• On the vehicle, there are 3 cameras: one in the middle, one on the left and one on the right. Normally you just need to train the CNN to map the image collected from the center camera to the steering angle, and be done with it. However, it turns out that you can use the side cameras to teach the vehicle to recover from mistakes. For example, if the car is taking a left turn, then you can use the image from the left camera to teach it to do a softer left turn, and the image from the right camera do a harder left turn. Using this approach, during inference, you only need to run inference on the center image. How much softer and harder should be empirically determined.
You might think that you can read 3 images in the same time, and feed all three into the network, but that will require 3 images during inference, which might slow down the inference.
In fact the above technique is used by nVidia in their paper, and it could help the vehicle to recover from mistake, for example when it is close to the edge of the road.
Another data augmentation technique is to vertically flip the images, and reverse the steering angle. Using both techniques, you can augment the training set by a factor of 6.
• Inference time is crucial. In the beginning, I struggled a lot making the model to work. Then at some point I realize that it took around 0.1 second to evaluate the model, which might be too slow to drive a car. I then reduce the size of the model, until the point where it takes 0.01 seconds to evaluate, then the vehicle starts driving smoothly.

So how small (or big) your model should be? This obviously depends on the training set,  but is there any rule of thumb? A related question that some people also asked me is how big the training set should be? We keep saying Deep Learning needs big datasets, but how big is big, or how big should it be to expect some sensible performance? I hope the rest of this post could answer those questions.

How big the model should be?

Let’s say you have a training set of N samples. Now if I use a simple array of bits to store those samples, then I would need N bits to store N samples (the first bit is ON given the first sample, and so on). More strictly, I could say I only need $\log_2\left(N\right)$ bits to store N samples, because I could have N different configurations with that many bits.

In Deep Learning, we are well graduated from speaking in bits, but the same principle still holds. The easiest answer is you will need to construct your model so that it has N parameters to learn a training set of N samples.

That is still too lax though. Recall that a parameter in a neural net is a 32-bit floating point number, so a model of N parameters will have $32N$ bits in total. That’s why you would only need a model of $\frac{N}{32}$ parameters?

Not that strict. Although the parameters in neural nets are floating points, their values are often small, typically in the range of -0.3 to 0.3 (depending on how you normalize the data). This is due to various tricks we apply to the nets like initialization and small learning rate, in order to make optimization easier.

Since their values are restricted, probably only a few bits in each parameters are carrying useful information. How many is that? Typically people think it is about 8 or 16 bits. The proof for that is when you quantize the nets to low-precision (of 8 or 16 bits), then the performance of the net doesn’t decrease much.

So, as a typical (wild) rule of thumb, you should be able to overfit a training set of size N with a model of $\frac{N}{4}$ parameters. If you cannot overfit the training set, you are doing something really wrong with your initialization, learning rate and regularizer.

So you need to know how to count the number of parameters in a deep net. For fully connected layers, that simply is the size of the weight matrix and the biases. For convolutional layers, it is the size of the filter, multiplied by the number of filters. Most  modern Deep learning framework doesn’t use biases for convolutional layer, but in the past, people used to use a bias for each filter, so keep in mind that if you want to be very precise. The vanila RNN can be computed similarly.

LSTM is a bit more tricky, because there are a few variants of those: whether peephole is enabled, whether the forget bias is fixed, is it multi-dimensional LSTM, etc.. so the exact number might vary. However in general, the number of parameters of an LSTM layers of p units with inputs should be in the order of $5pq$.

Some time ago I used to write a python script to compute the exact number of parameters in a MDLSTM cell, but looking at it now took me some time to understand it.

I hope this points out that the key advantage of Deep Learning, compared to traditional method, is we can engineer the model as big as we want, sometimes depending on the dataset. This is not easily doable with other models like SVM and the like.

How big is the training set?

Using a similar reasoning, you could also answer this pretty easily.

Assume that your input is a N-dimensional vector, then the maximum number of configuration in that space is $2^{32N}$, which is enormous (sorry for using the word, you have Donald Trump to blame).

Of course that is the number of distinct configuration for all possible input. Your input domain is likely going to be a manifold in that high-dimensional space, meaning it will probably only take a tenth of that many degrees of freedom. So let’s say $2^{N/10}$.

Now you don’t need every sample in your input domain to train a deep model. As long as your input domain is relatively smooth, and the training set covers the most important modes in the data distribution, the model should be able to figure out the missing regions. So again, probably you only need a fifth of those, meaning around $2^{N/50}$ samples.

For instance in MNIST, the input is of $28 * 28 = 784$ dimensions, then you should have around $2^{784/50} \approx 32000$ samples. In fact there are 50000 samples in the MNIST training set.

In general, I think the rule of thumb would be around tens of thousands samples for a typical problem so that you can expect some optimistic results.

Note that those calculations are very coarse, and should only be used to give some intuition. They shouldn’t be used as an exact calculation as-it-is.

The problem is worse with time series and sequential data in general. Using the same calculation, you would end up with pretty big numbers because you need to multiply the numbers by the length of the sequence. I don’t think the same calculation can be applied for sequential data, because in sequences, the correlation between consecutive elements also play a big role in learning, so that might lax or limit the degree of freedom of the data. However, I used to work with small sequence dataset of size around tens of thousands samples. For difficult datasets, we might need half a million of samples.

The more you work on modelling, the more you learn about it. As always, I would love to hear your experience!

# What Goes Around Comes Around: Databases, Big Data and the like

Over the years, I have the privilege of working with some pretty damn good people. One of those guys is a PhD in Database Research, used to be a professor, but apparently so passionate and so good at teaching that he eventually quits academia to join the industry.

He did an PhD in XML database, and even though XML database is completely useless, it doesn’t mean a PhD in XML Database couldn’t teach you anything (in fact, a good PhD could teach you quite many useful things). One interesting thing I learned from him was the evolution of database technology, which originates from an essay by Michael Stonebraker called What goes around comes around.

Michael Stonebraker is a big name in Database Research and has been around in Database research for a good 35 years, so you could easily expect him to be highly opinionated on a variety of things. The first few lines in the essay read like this:

In addition, we present the lessons learned from the exploration of the proposals in each era. Most current researchers were not around for many of the previous eras, and have limited (if any) understanding of what was previously learned. There is an old adage that he who does not understand history is condemned to repeat it. By presenting “ancient history”, we hope to allow future researchers to avoid replaying history.

Unfortunately, the main proposal in the current XML era bears a striking resemblance to the CODASYL proposal from the early 1970’s, which failed because of its complexity. Hence, the current era is replaying history, and “what goes around comes around”. Hopefully the next era will be smarter.

His work, among others, include PostgreSQL. Hence, after reading the essay, it becomes obvious to me why there are so many highly opinionated design decisions being implemented in Postgres.

The essay is a very good read. You get to see how database technologies evolved from naive data models to unnecessarily complicated models, then thanks to a good mathematician named Edgar F. Codd, it got way more beautiful and highly theoretically-grounded. After a few alternatives, a new wave of XML databases come back bearing a lot of complications. (Along the way, you also get to see how Michael Stonebraker managed to sell several startups, but that isn’t the main story – or is it?)

There are many interesting lesson learned. The most two interesting for me are:

• XML database doesn’t take off because it is exceedingly complicated, and there is no way to efficiently store and index it using our best data structures like B-trees and the like.
I learned XML databases and I was told that XML databases failed because it lacks a theoretical foundation like the Relational model. Now looking back, I think that isn’t the main issue. The problem with XML is that it allows too much flexibility in the language, so implementing a good query optimizer for it is extremely difficult.
A bit more ironically, this is how Michael Stonebraker puts it:

We close with one final cynical note. A couple of years ago OLE-DB was being pushed hard by Microsoft; now it is “X stuff”. OLE-DB was pushed by Microsoft, in large part, because it did not control ODBC and perceived a competitive advantage in OLE-DB. Now Microsoft perceives a big threat from Java and its various cross platform extensions, such as J2EE. Hence, it is pushing hard on the XML and Soap front to try to blunt the success of Java

It sounds very reasonable to me. At some point around 2000-2010, I remember hearing things like XML is everywhere in Windows. It has to be someone like Microsoft keeps pushing it hard to make it quite phenomenal. When Microsoft started the .NET effort to directly compete with Java, the XML database movement faded away.
One thing Michael Stonebraker got it wrong though. In the essay, he said XML (and SOAP) is gonna be the data exchange format of the future, but it turns out XML is still overly complicated for this purpose, and people ended up with JSON and RESTful instead.

• The whole competitive advantage of PostgreSQL was about UDTs and UDFs, a somewhat generalization of stored procedures. Although stored procedures are soon out-of-fashion because people realize it is difficult to maintain your code in multiple places, both in application code and store procedures in DBMS. However, the idea of bringing code close to data (instead of bringing data to code) is a good one, and has a big consequence on the Big Data movement.

Speaking of Big Data, Stonebraker must have something to say about it. For anyone who is in Big Data, you should really see this if you haven’t:

The talk presents a highly organized view on various aspects of Big Data and how people solved them (and of course mentions a few startups founded by our very Michael Stonebraker).

He mentioned Spark at some point. If you look at Spark nowadays, it’s nothing more than an in-memory distributed SQL engine (for traditional business intelligence queries), along with a pretty good Machine Learning library (for advanced analytics). From a database point of view, Spark looks like a toy: you can’t share tables, tables don’t have indices, etc… but the key idea is still there: you bring computation to the data.

Of course I don’t think Spark wants to become a database eventually, so I am not sure if Spark plans to fix those issues at all, but adding catalog (i.e. data schema), and supporting a somewhat full-fledged SQL engine were pretty wise decisions.

There are several other good insights about the Big Data ecosystem as well: why MapReduce sucks, what are other approaches to solve the Big Volume problem (besides Spark), how to solve the Big Velocity problem with streaming, SQL, NoSQL and NewSQL, why the hardest problem in Big Data is Variety, etc…  I should’ve written a better summary of those, but you could simply check it out.

# Random thoughts on topological sort, backpropagation and stuff you learn in school

I learned the topological sort algorithm about 10 years ago in my undergrad, and for me it was pretty bizarre. I remember it was in one of the Introduction to Algorithms modules, where we were dealing with algorithms for graphs, like Binary search trees, RB trees and so on. Then suddenly, out of nowhere, we got to learn topological sort – an algorithm for sorting partially ordered sets.

Erhh.. what sets were that?

I mean I got RB tree because it is beautiful, and even though you probably won’t need to directly implement it at all in your whole career, learning the thing will improve your algorithmic mindset. But giving an order to a set of things which might not have a well-defined total order? Sorry but how useful it is gonna be?

Turns out the algorithm is neither fancy nor beautiful. Like most of other algorithms in textbooks, it is beautiful to some degree, but in a module where they taught you algorithms, it was difficult to pay any particular attention to topological sort alone.

Now in retrospect, I think topo sort is the single algorithm from those modules that I have been using most often. Okay, maybe I have been using quick sort more often, but I don’t need to re-implement it (except in interviews, which don’t count).

Imagine you want to run backpropagation in a neural network of multiple layers. It would be easy if your network is a simple chain of layers, but if you allow a layer to have multiple inputs (something like multi-modal nets), then how are you going to do it in a generic way?

Another example is when you have a pipeline of stuff, let it be a business workflow, a data flow, etc… where every operation might depend on other operations. The generic way to figure out an order to execute those operations is via topological sort.

If you have been using Tensorflow, then yes, they implement it for the graphs. In tensorflow, there are many different kinds of connection, an Op can be input to other Ops, or its execution simply needs to be happens before some other Ops (when some variables need to be initialized, for instance), but the general idea doesn’t change.

If there is any textbook algorithm that I am re-implementing again and again, I believe it is topological sort, and I couldn’t be more thankful to the professors who taught me this thing (fundamentally).

In fact, I realize it’s super difficult to fully anticipate the beauty and potential applications of stuff they taught you when you were in school. I am pretty sure most of them are useless (like I got to learn XML databases and object oriented databases somewhere in my education), but some will turn out to be unexpectedly valuable at some point in your career. This applies to many other things you learn outside of school as well. The trick is to know what to learn to maximize the profit in the long run.

Unfortunately I don’t know the answer to that. My strategy so far is to learn as much as I could, from as many people as I could. But we will see how far that will work (although I think it wouldn’t work for too long).

Before you ask, the next most popular textbook algorithm I have been using is state machines. When I was implementing my toy compiler around a decade ago, I didn’t imagine I would need to do it again and again for various kinds of parsers I had been doing.

And of course, Viterbi, I had some good time playing around with various aspects of it.

# Secured private Docker registry in Kubernetes

If you run a Docker-based Kubernetes cluster yourself, sooner or later you will find out that you need a Docker registry to store the docker images. You might start out with a public registry out there, but often you might want to keep your images away from the public. Now if your cluster is on the cloud, you can just use the Container Registry provided by AWS EC2 or Google Cloud Platform. If your cluster is on-prem however, then you might want to keep the registry close to your cluster, hence deploying your own registry might be a good idea.

For starters, you can always use the registry addon shipped with Kubernetes. The default setup will give you an unsecured registry, so you will need to setup a DeamonSet to route a local port to the registry, so that to the workers, your registry runs on localhost:PORT, which will not trigger the secured logic of the docker daemon. Check the link for more information.

This setup is rather bad though. If a user, from his machine, wants to push his image to the registry, then he has to use kubectl to setup a proxy to the registry service, so that the service is available on his machine at localhost:PORT. This is rather inconvenient and tedious. We need a registry available at a separated host name, so that it can receive images from any machines in the network, and serve images to any workers in the Kubernetes cluster.

# Yes, you should tune your pesky learning rate

For a ConvNN I trained recently, this is the learning curves when using Adam optimizer with initial learning rate = 0.01:

When using the traditional SGD with initial learning rate = 0.01, momentum = 0.9 and decaying learning rate every 3 epochs with decay rate of 0.96, the learning curves become:

I hope you see the drastic difference. With momentum, we got 10% error rate after 5 epochs. While with Adam, we got ~30% error rate after 20 epochs.

Now it might happen that Adam will work better if we add fancy stuff like Batch Norm and the likes to the network, which I didn’t try. However, when everything else being equal, it feels to me that Adam was a bit aggressive in decreasing the learning rate, which makes learning progress slow after a while.

Since the learning rate is more strongly regulated in Adam, perhaps we can be more lax in setting the initial learning rate? This is the learning curves for Adam with initial learning rate = 0.1

It went wild eventually.

But with initial learning rate = 0.001, Adam gives this:

It is much better now.

Over the years, momentum and decaying learning rate has been my first choice for tuning the learning rate. I sometimes use Adagrad/RMSProp/Adam for cross-checking, but the best results are usually found with momentum, often with less training epochs.

The take-away message is you should really tune your learning rate hard. It is still one of the most important hyper-parameters. Although methods like Adam/Adagrad might make the impression that tuning the learning rate is easy, in fact it is very problem-dependent. Momentum has many more knobs to tune, but once used wisely, it will be very flexible and powerful. Often you will end up to the same ballpark with any of those optimizers.