# Kalman filters (and how they relate to HMMs)

Kalman filters are insanely popular in many engineering fields, especially those involve sensors and motion tracking. Consider how to design a radar system to track military aircrafts (or warships, submarines, … for that matter), how to track people or vehicles in a video stream, how to predict location of a vehicle carrying a GPS sensor… In all these cases, some (advanced) variation of Kalman filter is probably what you would need.

Learning and teaching Kalman filters is therefore quite challenging, not only because of the mere complexity of the algorithms, but also because there are many variations of them.

With a Computer Science background, I encountered Kalman filters several years ago, but never managed to put them into the global context of the field. I had chances to look at them again recently, and rediscovered yet another way to present and explain Kalman filters. It made a lot of sense to me, and hopefully it does to you too.

Note that there are a lot of details missing from this post (if you are building a radar system to track military aircrafts, look somewhere else!). I was just unhappy to see many introductory material on Kalman filters are either too engineering or too simplified. I want something more Machine Learning-friendly, so this is my attempt.

Let’s say you want to track an enemy’s aircraft. All you have is a lame radar (bought from Russia probably) which, when oriented properly, will give you a 3-tuple of range, angle and angular velocity $[r \;\phi\;\dot{\phi}]^{T}$ of the aircraft. This vector is called the observation $\mathbf{z}_k$ (subscript $k$ because it depends on time). The actual position of the aircraft, though, is a vector in cartesian coordinates $\mathbf{x}_k = [x_1\;x_2\;x_3]^{T}$. Since it is an enemy’s aircraft, you can only observe $\mathbf{z}_k$, and you want to track the state vector $\mathbf{x}_k$ over time, every time you receive a measurement $\mathbf{z}_k$ from the radar.

Visualised as a Bayesian network, it looks like this:

With all the Markov properties hold, i.e. $\mathbf{x}_k$ only depends on $\mathbf{x}_{k-1}$ and $\mathbf{z}_k$ only depends on $\mathbf{x}_k$, does this look familiar?

# Variational Autoencoders 3: Training, Inference and comparison with other models

Variational Autoencoders 1: Overview
Variational Autoencoders 2: Maths
Variational Autoencoders 3: Training, Inference and comparison with other models

Recalling that the backbone of VAEs is the following equation:

$\log P\left(X\right) - \mathcal{D}\left[Q\left(z\vert X\right)\vert\vert P\left(z\vert X\right)\right] = E_{z\sim Q}\left[\log P\left(X\vert z\right)\right] - \mathcal{D}\left[Q\left(z\vert X\right) \vert\vert P\left(z\right)\right]$

In order to use gradient descent for the right hand side, we need a tractable way to compute it:

• The first part $E_{z\sim Q}\left[\log P\left(X\vert z\right)\right]$ is tricky, because that requires passing multiple samples of $z$ through $f$ in order to have a good approximation for the expectation (and this is expensive). However, we can just take one sample of $z$, then pass it through $f$ and use it as an estimation for $E_{z\sim Q}\left[\log P\left(X\vert z\right)\right]$ . Eventually we are doing stochastic gradient descent over different sample $X$ in the training set anyway.
• The second part $\mathcal{D}\left[Q\left(z\vert X\right) \vert\vert P\left(z\right)\right]$ is even more tricky. By design, we fix $P\left(z\right)$ to be the standard normal distribution $\mathcal{N}\left(0,I\right)$ (read part 1 to know why). Therefore, we need a way to parameterize $Q\left(z\vert X\right)$ so that the KL divergence is tractable.

Here comes perhaps the most important approximation of VAEs. Since $P\left(z\right)$ is standard Gaussian, it is convenient to have $Q\left(z\vert X\right)$ also Gaussian. One popular way to parameterize $Q$ is to make it also Gaussian with mean $\mu\left(X\right)$ and diagonal covariance $\sigma\left(X\right)I$, i.e. $Q\left(z\vert X\right) = \mathcal{N}\left(z;\mu\left(X\right), \sigma\left(X\right)I\right)$, where $\mu\left(X\right)$ and $\sigma\left(X\right)$ are two vectors computed by a neural network. This is the original formulation of VAEs in section 3 of this paper.

This parameterization is preferred because the KL divergence now becomes closed-form:

$\displaystyle \mathcal{D}\left[\mathcal{N}\left(\mu\left(X\right), \sigma\left(X\right)I\right)\vert\vert P\left(z\right)\right] = \frac{1}{2}\left[\left(\sigma\left(X\right)\right)^T\left(\sigma\left(X\right)\right) +\left(\mu\left(X\right)\right)^T\left(\mu\left(X\right)\right) - k - \log \det \left(\sigma\left(X\right)I\right) \right]$

Although this looks like magic, but it is quite natural if you apply the definition of KL divergence on two normal distributions. Doing so will teach you a bit of calculus.

So we have all the ingredients. We use a feedforward net to predict $\mu\left(X\right)$ and $\sigma\left(X\right)$ given an input sample $X$ draw from the training set. With those vectors, we can compute the KL divergence and $\log P\left(X\vert z\right)$, which, in term of optimization, will translate into something similar to $\Vert X - f\left(z\right)\Vert^2$.

It is worth to pause here for a moment and see what we just did. Basically we used a constrained Gaussian (with diagonal covariance matrix) to parameterize $Q$. Moreover, by using $\Vert X - f\left(z\right)\Vert^2$ for one of the training criteria, we implicitly assume $P\left(X\vert z\right)$ to be also Gaussian. So although the maths that lead to VAEs are generic and beautiful, at the end of the day, to make things tractable, we ended up using those severe approximations. Whether those approximations are good enough totally depend on practical applications.

There is an important detail though. Once we have $\mu\left(X\right)$ and $\sigma\left(X\right)$ from the encoder, we will need to sample $z$ from a Gaussian distribution parameterized by those vectors. $z$ is needed for the decoder to reconstruct $\hat{X}$, which will then be optimized to be as close to $X$ as possible via gradient descent. Unfortunately, the “sample” step is not differentiable, therefore we will need a trick call reparameterization, where we don’t sample $z$ directly from $\mathcal{N}\left(\mu\left(X\right), \sigma\left(X\right)\right)$, but first sample $z'$ from $\mathcal{N}\left(0, I\right)$, and then compute $z = \mu\left(X\right) + \mu\left(X\right)Iz'$. This will make the whole computation differentiable and we can apply gradient descent as usual.

The cool thing is during inference, you won’t need the encoder to compute $\mu\left(X\right)$ and $\sigma\left(X\right)$ at all! Remember that during training, we try to pull $Q$ to be close to $P\left(z\right)$ (which is standard normal), so during inference, we can just inject $\epsilon \sim \mathcal{N}\left(0, I\right)$ directly into the decoder and get a sample of $X$. This is how we can leverage the power of “generation” from VAEs.

There are various extensions to VAEs like Conditional VAEs and so on, but once you understand the basic, everything else is just nuts and bolts.

To sum up the series, this is the conceptual graph of VAEs during training, compared to some other models. Of course there are many details in those graphs that are left out, but you should get a rough idea about how they work.

In the case of VAEs, I added the additional cost term in blue to highlight it. The cost term for other models, except GANs, are the usual L2 norm $\Vert X - \hat{X}\Vert^2$.

GSN is an extension to Denoising Autoencoder with explicit hidden variables, however that requires to form a fairly complicated Markov Chain. We may have another post  for it.

With this diagram, hopefully you will see how lame GAN is. It is even simpler than the humble RBM. However, the simplicity of GANs makes it so powerful, while the complexity of VAE makes it quite an effort just to understand. Moreover, VAEs make quite a few severe approximation, which might explain why samples generated from VAEs are far less realistic than those from GANs.

That’s quite enough for now. Next time we will switch to another topic I’ve been looking into recently.

This is the slides I used for a talk I did recently in our reading group. The slides, particularly the Attention part, was based on one of Quoc Le’s talks on the same topic. I couldn’t come up with any better visual than what he did.

It has been quite a while since the last time I look at this topic, unfortunately I never managed to fully anticipate its beauty. Seq2Seq is one of those simple-ideas-that-actually-work in Deep Learning, which opened up a whole lot of possibilities and enabled many interesting work in the field.

A friend of mine did Variational Inference for his PhD, and once he said Variational Inference is one of those mathematically-beautiful-but-don’t-work things in Machine Learning.

Indeed, there are stuff like Variational, Bayesian inference, Sum-Product Nets etc… that come with beautiful mathematical frameworks, but don’t really work at scale, and stuff like Convolutional nets, GANs, etc.. that are a bit slippery in their mathematical foundation, often empirically discovered, but work really well in practice.

So even though many people might not really like the idea of GANs, for example, but given this “empirical tradition” in Deep Learning literature, probably they are here to stay.

# Self-driving cars, again

This is my second take on Self-driving cars, a bit more serious than last time. You might be surprised to know that it is a combination of many old-school stuff in Computer Vision and Machine Learning like Perspective Transform, thresholding, Image warping,  sliding windows, HoG, linear SVM, etc…

Three months ago I kept wondering how would Self-driving cars work in Vietnam.

Now I am certain that it will never work, at least for the next 20 years (in Vietnam or in India, for that matter).

# Variational Autoencoders 1: Overview

In a previous post, we briefly mentioned some recent approaches for Generative Modeling. Among those, RBMs and DBMs are probably very tricky because the estimation of gradients in those models is based on a good mixing of MCMC, which tends to get worse during the course of training because the model distribution gets sharper. Autogressive models like PixelRNN, WaveNet, etc… are easier to train but have no latent variables, which makes them somewhat less powerful. Therefore, the current frontier in Generative Modelling is probably GANs and Variational Autoencoders (VAEs).

While GANs are too mainstream, I thought I can probably write a post or two about Variational Autoencoders, at least to clear up some confusions I am having about them.

Formally, generative modeling is the area in Machine Learning that deals with models of distributions $P(X)$, defined over datapoints $X$ in some high-dimensional space $\mathcal{X}$. The whole idea is to construct models of $P(X)$ that assigns high probabilities to data points similar to those in the training set, and low probabilities every where else. For example, a generative models of images of cows should assign small probabilities to images of human.

However, computing the probability of a given example is not the most exciting thing about generative models. More often, we want to use the model to generate new samples that look like those in the training set. This “creativity” is something unique to generative models, and does not exist in, for instance, discriminative models. More formally, say we have a training set sampled from an unknown distribution $P_\text{org}(X)$, and we want to train a model $P$ which we can take sample from, such that $P$ is as close as possible to $P_\text{org}$.

Needless to say, this is a difficult problem. To make it tractable, traditional approaches in Machine Learning often have to 1) make strong assumptions about the structure of the data, or 2) make severe approximation, leading to suboptimal models, or 3) rely on expensive sampling procedures like MCMC. Those are all limitations which make Generative modeling a long-standing problem in ML research.

Without further ado, let’s get to the point. When $\mathcal{X}$ is a high-dimensional space, modeling is difficult mostly because it is tricky to handle the inter-dependencies between dimensions. For instance, if the left half of an image is a horse then probably the right half is likely another horse.

To further reduce this complexity, we add a latent variable $z$ in a high-dimensional space $\mathcal{Z}$ that we can easily sample from, according to a distribution $P(z)$ defined over $\mathcal{Z}$. Then say we have a family of deterministic function $f(z;\theta)$ parameterized by a vector $\theta$ in some space $\Theta$ where $f: \mathcal{Z} \times \Theta \rightarrow \mathcal{X}$. Now $f$ is deterministic, but since $z$ is a random variable, $f(z;\theta)$ is a random variable in $\mathcal{X}$.

During inference, we will sample $z$ from $P(z)$, and then train $\theta$ such that $f(z;\theta)$ is close to samples in the training set. Mathematically, we want to maximize the following probability for every sample $X$ in the training set:

$\displaystyle P(X) = \int P\left(X\vert z; \theta\right)P(z)dz$   (1)

This is the good old maximum likelihood framework, but we replace $f(z;\theta)$ by $P\left(X\vert z;\theta\right)$ (called the output distribution) to explicitly indicate that $X$ depends on $z$, so that we can use the integral to make it a proper probability distribution.

There are a few things to note here:

• In VAEs, the choice of the output distribution is often Gaussian, i.e. $P\left(X\vert z;\theta\right) = \mathcal{N}\left(X; f(z;\theta), \sigma^2 * I\right)$, meaning it is a Gaussian distribution with mean $f(z;\theta)$ and diagonal covariance matrix where $\sigma$ is a scalar hyper-parameter. This particular choice has some important motivations:
• We need the output distribution to be continuous, so that we can use gradient descent on the whole model. It wouldn’t be possible if we use discontinuous function like the Dirac distribution, meaning to use exactly the output value of $f(z;\theta)$ for $X$.
• We don’t really need to train our model such that $f(z;\theta)$ produces exactly some sample X in the training set. Instead, we want it to produce samples that are merely like X. In the beginning of training, there is no way for $f$ to gives exact samples in the training set. Hence by using a Gaussian, we allow the model to gradually (and gently) learn to produce samples that are more and more like those in the training set.
• It doesn’t have to be Gaussian though. For instance, if $X$ is binary, we can make $P\left(X\vert z;\theta\right)$ a Bernoulli parameterized by $f(z;\theta)$. The important property is that $P\left(X\vert z\right)$ can be computed, and is continuous in the domain of $\theta$.
• The distribution of $z$ is simply the normal distribution, i.e. $P(z) = \mathcal{N}\left(0,I\right)$. Why? How is it possible? Is there any limitation with this? A related question is why don’t we have several levels of latent variables. which potentially might help modelling complicated processes?
All those question can be answered by the key observation that any distribution in $d$ dimensions can be generated by taking $d$ variables from the normal distribution and mapping them through a sufficiently complicated function.
Let that sink for a moment. Readers who are interested in the mathematical details can have a look at the conditional distribution method described in this paper. You can also convince yourself if you remember how we can sample from any Gaussian as described in an earlier post.
Now, this observation means we don’t need to go to more than one level of latent variable, with a condition that we need a sufficiently complicated function for $f(z;\theta)$. Since deep neural nets has been shown to be a powerful function approximator, it makes a lot of sense to use deep neural nets for modeling $f$.
• Now the only business is to maximize (1). Using the law of large numbers, we can approximate the integral by the expected value over a large number of samples. So the plan will be to take a very large sample $\left\{z_1, ..., z_n\right\}$ from $P(z)$, then compute $P(X) \approx \frac{1}{n}\sum_i P\left(X\vert z_i;\theta\right)$. Unfortunately the plan is infeasible because in high dimensional spaces, $n$ needs to be very large in order to have a good enough approximation of $P(X)$ (imagine how much samples you would need for $200 \times 200 \times 3$ images, which is in 120K dimensional space?)
Now the key to realize is that we don’t need to sample $z$ from all over $P(z)$. In fact, we only need to sample $z$ such that $f(z;\theta)$ is more likely to be similar to samples in the training set. Moreover, it is likely that for most of $z$, $P(X\vert z)$ is nearly zero, and therefore contribute very little into the estimation of $P(X)$. So the question is: is there any way to sample $z$ such that it is likely to generate $X$, and only estimate $P(X)$ from those?
It is the key idea behind VAEs.

That’s quite enough for an overview. Next time we will do some maths and see how we go about maximizing (1). Hopefully I can then convince you that VAEs, GANs and GSNs are really not far away from each other, at least in their core ideas.

So I read the Learning to learn paper a while ago, and I was surprised that the Decoupled Neural Interfaces paper didn’t cite them. For me the ideas are pretty close, where you try to predict the gradient used in each step of gradient descent, instead of computing it by backpropagation. Taking into account that they are all from DeepMind, won’t it be nice to cite each other and increase the impact factors for both of them?

Nevertheless, I enjoyed the paper. The key idea is instead of doing a normal update $\theta_{t+1} = \theta_{t} - \alpha_t \nabla f\left(\theta_t\right)$, we do it as $\theta_{t+1} = \theta_{t} + g_t\left(\nabla f\left(\theta_t\right), \phi\right)$ where $g_t$ is some function parameterized by $\phi$.

Now one can use any function approximator for $g_t$ (called optimizer, to differentiate with $f\left(\theta\right)$ – the optimizee), but using RNNs has a particular interesting intuition as we hope that the RNNs can remember the gradient history and mimic the behavior of, for instance, momentum.

The convenient thing about this framework is that the objective function for training the optimizer is the expected weighted sum of the output of the optimizee $f\left(\theta\right)$. Apart from this main idea, everything else is nuts and bolts, which of course are equivalently important.

The first obstacle that they had to solve is how to deal with big models of perhaps millions of parameters. In such cases, $g_t$ has to input and output vector of millions of dimensions. Instead, the authors solved this problem very nicely by only working with one parameter at a time, i.e. the optimizer only takes as input one element of the gradient vector and output the update for that element. However, since the optimizer is a LSTM, the state of the gradient coordinates are maintained separately. This also has a nice side effect that  it reduces the size of the optimizer, and you can potentially re-use the optimizer for different optimizees.

The next two nuts and bolts are not so obvious. To mimic the L2 gradient clipping trick, they used the so-called global averaging cell (GAC), where the outgoing activations of LSTM cells are averaged at each step across all coordinates. To mimic Hessian-based optimization algorithms, they wire the LSTM optimizer with an external memory unit, hoping that the optimizer will learn to store the second-order derivatives in the memory.

Although the experimental results look pretty promising, many people pose some doubts about the whole idea of learning to learn. I was in the panel discussion of Learning to learn at NIPS, and it wasn’t particularly fruitful (people were drinking sangria all the time). It will be interesting to see the follow-ups on this line of work, if there is any.

# What Goes Around Comes Around: Databases, Big Data and the like

Over the years, I have the privilege of working with some pretty damn good people. One of those guys is a PhD in Database Research, used to be a professor, but apparently so passionate and so good at teaching that he eventually quits academia to join the industry.

He did an PhD in XML database, and even though XML database is completely useless, it doesn’t mean a PhD in XML Database couldn’t teach you anything (in fact, a good PhD could teach you quite many useful things). One interesting thing I learned from him was the evolution of database technology, which originates from an essay by Michael Stonebraker called What goes around comes around.

Michael Stonebraker is a big name in Database Research and has been around in Database research for a good 35 years, so you could easily expect him to be highly opinionated on a variety of things. The first few lines in the essay read like this:

In addition, we present the lessons learned from the exploration of the proposals in each era. Most current researchers were not around for many of the previous eras, and have limited (if any) understanding of what was previously learned. There is an old adage that he who does not understand history is condemned to repeat it. By presenting “ancient history”, we hope to allow future researchers to avoid replaying history.

Unfortunately, the main proposal in the current XML era bears a striking resemblance to the CODASYL proposal from the early 1970’s, which failed because of its complexity. Hence, the current era is replaying history, and “what goes around comes around”. Hopefully the next era will be smarter.

His work, among others, include PostgreSQL. Hence, after reading the essay, it becomes obvious to me why there are so many highly opinionated design decisions being implemented in Postgres.

The essay is a very good read. You get to see how database technologies evolved from naive data models to unnecessarily complicated models, then thanks to a good mathematician named Edgar F. Codd, it got way more beautiful and highly theoretically-grounded. After a few alternatives, a new wave of XML databases come back bearing a lot of complications. (Along the way, you also get to see how Michael Stonebraker managed to sell several startups, but that isn’t the main story – or is it?)

There are many interesting lesson learned. The most two interesting for me are:

• XML database doesn’t take off because it is exceedingly complicated, and there is no way to efficiently store and index it using our best data structures like B-trees and the like.
I learned XML databases and I was told that XML databases failed because it lacks a theoretical foundation like the Relational model. Now looking back, I think that isn’t the main issue. The problem with XML is that it allows too much flexibility in the language, so implementing a good query optimizer for it is extremely difficult.
A bit more ironically, this is how Michael Stonebraker puts it:

We close with one final cynical note. A couple of years ago OLE-DB was being pushed hard by Microsoft; now it is “X stuff”. OLE-DB was pushed by Microsoft, in large part, because it did not control ODBC and perceived a competitive advantage in OLE-DB. Now Microsoft perceives a big threat from Java and its various cross platform extensions, such as J2EE. Hence, it is pushing hard on the XML and Soap front to try to blunt the success of Java

It sounds very reasonable to me. At some point around 2000-2010, I remember hearing things like XML is everywhere in Windows. It has to be someone like Microsoft keeps pushing it hard to make it quite phenomenal. When Microsoft started the .NET effort to directly compete with Java, the XML database movement faded away.
One thing Michael Stonebraker got it wrong though. In the essay, he said XML (and SOAP) is gonna be the data exchange format of the future, but it turns out XML is still overly complicated for this purpose, and people ended up with JSON and RESTful instead.

• The whole competitive advantage of PostgreSQL was about UDTs and UDFs, a somewhat generalization of stored procedures. Although stored procedures are soon out-of-fashion because people realize it is difficult to maintain your code in multiple places, both in application code and store procedures in DBMS. However, the idea of bringing code close to data (instead of bringing data to code) is a good one, and has a big consequence on the Big Data movement.

Speaking of Big Data, Stonebraker must have something to say about it. For anyone who is in Big Data, you should really see this if you haven’t:

The talk presents a highly organized view on various aspects of Big Data and how people solved them (and of course mentions a few startups founded by our very Michael Stonebraker).

He mentioned Spark at some point. If you look at Spark nowadays, it’s nothing more than an in-memory distributed SQL engine (for traditional business intelligence queries), along with a pretty good Machine Learning library (for advanced analytics). From a database point of view, Spark looks like a toy: you can’t share tables, tables don’t have indices, etc… but the key idea is still there: you bring computation to the data.

Of course I don’t think Spark wants to become a database eventually, so I am not sure if Spark plans to fix those issues at all, but adding catalog (i.e. data schema), and supporting a somewhat full-fledged SQL engine were pretty wise decisions.

There are several other good insights about the Big Data ecosystem as well: why MapReduce sucks, what are other approaches to solve the Big Volume problem (besides Spark), how to solve the Big Velocity problem with streaming, SQL, NoSQL and NewSQL, why the hardest problem in Big Data is Variety, etc…  I should’ve written a better summary of those, but you could simply check it out.