Kalman filters (and how they relate to HMMs)

Kalman filters are insanely popular in many engineering fields, especially those involve sensors and motion tracking. Consider how to design a radar system to track military aircrafts (or warships, submarines, … for that matter), how to track people or vehicles in a video stream, how to predict location of a vehicle carrying a GPS sensor… In all these cases, some (advanced) variation of Kalman filter is probably what you would need.

Learning and teaching Kalman filters is therefore quite challenging, not only because of the mere complexity of the algorithms, but also because there are many variations of them.

With a Computer Science background, I encountered Kalman filters several years ago, but never managed to put them into the global context of the field. I had chances to look at them again recently, and rediscovered yet another way to present and explain Kalman filters. It made a lot of sense to me, and hopefully it does to you too.

Note that there are a lot of details missing from this post (if you are building a radar system to track military aircrafts, look somewhere else!). I was just unhappy to see many introductory material on Kalman filters are either too engineering or too simplified. I want something more Machine Learning-friendly, so this is my attempt.

Let’s say you want to track an enemy’s aircraft. All you have is a lame radar (bought from Russia probably) which, when oriented properly, will give you a 3-tuple of range, angle and angular velocity [r \;\phi\;\dot{\phi}]^{T} of the aircraft. This vector is called the observation \mathbf{z}_k (subscript k because it depends on time). The actual position of the aircraft, though, is a vector in cartesian coordinates \mathbf{x}_k = [x_1\;x_2\;x_3]^{T}. Since it is an enemy’s aircraft, you can only observe \mathbf{z}_k, and you want to track the state vector \mathbf{x}_k over time, every time you receive a measurement \mathbf{z}_k from the radar.

Visualised as a Bayesian network, it looks like this:

Untitled Diagram (1)

With all the Markov properties hold, i.e. \mathbf{x}_k only depends on \mathbf{x}_{k-1} and \mathbf{z}_k only depends on \mathbf{x}_k, does this look familiar?

Yes, indeed. This looks a lot like the beloved Hidden Markov Models, where p\left(\mathbf{x}_k\vert\mathbf{x}_{k-1}\right) is called the transition model, and p\left(\mathbf{z}_k\vert\mathbf{x}_k\right) is called the observation model.

However, in HMMs, the states \mathbf{x}_k are discrete variables whose values can only be taken from a fixed set of states. In our case, the state vector \mathbf{x}_k is continuous, therefore HMM algorithms cannot be used in this setting.

1. The tracking problem

Before we see how we can solve this problem, let’s first formally define it. Let \mathbf{Z}_{1:k} = \left\{\mathbf{z}_1, ..., \mathbf{z}_k\right\} be the set of observations up to timestep t_k.

The whole point of tracking is to infer the value of the state at the current timestep \mathbf{x}_k, given the sequence of observation \mathbf{Z}_{1:k}. In other words, being a good Bayesian, that is to keep track the conditional probability distribution p\left(\mathbf{x}_k\vert\mathbf{Z}_{1:k}\right).

Let’s assume somehow we know the distribution p\left(\mathbf{x}_{k-1}\vert\mathbf{Z}_{1:k-1}\right). Now at timestep t_k, we receive the observation \mathbf{z}_k, and want to infer the distribution p\left(\mathbf{x}_k\vert\mathbf{Z}_{1:k}\right). How do we go about this?

Inspired by some similar idea in the field (e.g. the EM algorithm), we can do this in 2 step. The first step is to predict the value of \mathbf{x}_k  given the previous history \mathbf{Z}_{1:k-1}. Using the law of total probability for the conditional distribution of continuous variables, we can write:

\begin{array}{rl} p\left(\mathbf{x}_k\vert \mathbf{Z}_{1:k-1}\right)& = \int p\left(\mathbf{x}_k\vert\mathbf{x}_{k-1}, \mathbf{Z}_{1:k-1}\right)p\left(\mathbf{x}_{k-1}\vert\mathbf{Z}_{1:k-1}\right)d\mathbf{x}_{k-1} \\  & = \int p\left(\mathbf{x}_k\vert\mathbf{x}_{k-1}\right)p\left(\mathbf{x}_{k-1}\vert\mathbf{Z}_{1:k-1}\right)d\mathbf{x}_{k-1}\end{array}

The last equation was based on the Markov assumption that \mathbf{x}_k is independent of everything else given \mathbf{x}_{k-1}. Now we have all the distribution on the right (which are the transition model and the previous posterior at timestep t_{k-1}), we can just integrate over all possible values of \mathbf{x}_{k-1} to get the distribution on the left!

Assume that is doable (although maybe not trivial), we now have the prior distribution p\left(\mathbf{x}_k\vert \mathbf{Z}_{1:k-1}\right). With the new observation \mathbf{z}_k, using the Bayes rule, we can update our estimation as follow:

 \begin{array}{rl} p\left(\mathbf{x}_k\vert \mathbf{Z}_{1:k}\right) & = \displaystyle \frac{ p\left(\mathbf{z}_k\vert\mathbf{x}_k\right)p\left(\mathbf{x}_k\vert \mathbf{Z}_{1:k-1}\right)}{p\left(\mathbf{z}_k\vert \mathbf{Z}_{1:k-1}\right)} \\  & = \displaystyle \frac{p\left(\mathbf{z}_k\vert\mathbf{x}_k\right)p\left(\mathbf{x}_k\vert \mathbf{Z}_{1:k-1}\right)}{\int p\left(\mathbf{z}_k\vert\mathbf{x'}_k\right)p\left(\mathbf{x'}_k\vert \mathbf{Z}_{1:k-1}\right)d\mathbf{x'}_k}\end{array}

Here we use the observation model and the prior computed in the previous step to estimate the posterior p\left(\mathbf{x}_k\vert \mathbf{Z}_{1:k}\right) at time t_k. With this, we are ready to process the next observation.

Unfortunately, both the predict and update equations are intractable in the general case. We therefore need to find some ways to approximate them.

2. Kalman filter

Enter Kalman filter. In Kalman filter, we make the following assumption:

  • The prior distribution is Gaussian: p\left(\mathbf{x}_{k-1}\vert \mathbf{Z}_{1:k-1}\right)=\mathcal{N}\left(\hat{\mathbf{x}}_{k-1},\mathbf{P}_{k-1}\right), parameterized by the state mean vector \hat{\mathbf{x}}_{k-1} and covariance matrix \mathbf{P}_{k-1}.
    Predict and update the states now mean we need to keep track the value of \hat{\mathbf{x}}_{k-1} and \mathbf{P}_{k-1} over time.
  • The transition model is linear: \mathbf{x}_k = f\left(\mathbf{x}_{k-1}\right) = \mathbf{F}_k\mathbf{x}_{k-1} + \mathbf{B}_k\mathbf{u}_k + \mathbf{q}_k
    where \mathbf{u}_k is the control vector (zero for many application), \mathbf{q}_k \sim \mathcal{N}\left(0, \mathbf{Q}_k\right) is the process noise vector drawn from a zero-mean Gaussian distribution with covariance \mathbf{Q}_k.
    Practically, this means the transition distribution is Gaussian: p\left(\mathbf{x}_k\vert\mathbf{x}_{k-1}\right)=\mathcal{N}\left(\mathbf{F}_k\mathbf{x}_k,\mathbf{Q}_k\right)
    Note that although \mathbf{F}_k, \mathbf{B}_k, \mathbf{Q}_k are all subscripted t, but they don’t need to be kept track over the course of tracking. Often they only depend on the time difference between the latest timestep and the previous one.
  • The observation model is linear: \mathbf{z}_k = h\left(\mathbf{x}_k\right) = \mathbf{H}_k\mathbf{x}_k + \mathbf{r}_k
    where \mathbf{r}_k \sim \mathcal{N}\left(0, \mathbf{R}_k\right) is the observation noise.
    Similarly, this means the observation distribution is Gaussian: p\left(\mathbf{z}_k\vert\mathbf{x}_k\right) = \mathcal{N}\left(\mathbf{H}_k\mathbf{x}_k,\mathbf{R}_k\right)

Those assumption are based on the fact that a linear transformation of a Gaussian is also a Gaussian.

Now since everything is Gaussian, the equations become tractable and have closed-form solutions. The derivation of Kalman filter can be found everywhere on the Internet, and only involves a few matrix operations.

In fact this is why it is so popular. At the end of the day, if the state vector is low-dimensional, it can be implemented very efficiently on low-power devices, while giving good solutions once the assumptions above are satisfied.

3. Extended Kalman Filter and Unscented Kalman Filter

Unfortunately, life is not linear. For instance, the location of an aircraft is almost certainly not a linear function of its location in previous timestep (unless it is flying at a constant velocity, which is the best way to be shot down). For those applications, the assumption of Kalman filter doesn’t hold anymore.

There are a few way to deal with this scenario, where the transition function f\left(\mathbf{x}_{k-1}\right) and the observation function h\left(\mathbf{x}_k\right) are non-linear.

First, Extended Kalman Filter uses the Taylor Series expansion to approximate the function \mathbf{z}_t = h\left(\mathbf{x}_k\right) \approx h\left(\mathbf{x}_{k-1}\right) + \frac{\partial h}{\partial x}\left(\mathbf{x}_{k-1}\right)\left(\mathbf{x}_{k}-\mathbf{x}_{k-1}\right), and this is a linear function again. We then need to compute the Jacobian matrix at every timestep, then apply the vanilla Kalman Filter as usual.

However, computing the Jacobian matrix is messy, and the Taylor Series expansion might not very well-behaved for complicated process. In those case, Unscented Kalman Filter (UKF) might be used.

UKF is based on the observation that we only need to track the mean vector \hat{\mathbf{x}_{k}} and the covariance matrix \mathbf{P}_k over time. So for every non-linear transformation, we can just take a sample of 2n + 1 points (called the sigma points, where n is the dimensionality of the domain) and transform them by our function. We then can estimate the new mean and covariance based on the transformed sigma points.

It looks like this:

Untitled Diagram (2)

The details on how to choose  the sigma points and how to compute the transformed mean and variance can be found on the Internet. We are not going to repeat them here because they are boring, but the idea now gets closer to sampling methods.

We will have another post for Particle Filtering, with all the sampling methods. This is where things get serious.



  1. While the assumption of noise being Gaussian makes the presemtation , I wanted to be sure that your audience understood that it is not at all essential to the workings or success of the (linear) Kalman filter. If non-Gaussian, the estimate is the Minimum Unbiased Variance one, or MVUE, or Bayesian-type estimator. I underscore this because you’ll often see this connection made, that of Gaussinity and Kalman, even in some textbooks, and it’s just not true, so it corrupts the youth. (See Durbin & Koopman, Time Series Analysis by State Space Methods , 2nd edition, 2012, section 4.3.1.)

    The other important thing I believe thr presentation missed is the idea of a smoother , which runs offline and uses accumulated observations to do a backward pass and revise state estimates using future information. This can be important, say, if some observations are missing and there’s a desire to estimate the corresponding states. Indeed, given the presentation, a missing observation would break the process. You can see this in action at my blog .


    1. Thank Jan, for your comment.
      Yes, I know a lot of details are missing, and I didn’t plan to make a full “lecture” on the topic anyway.
      Interesting information on the distribution of the noise, I wasn’t aware of that.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s