# Stein’s identity and Stein Variational Gradient Descent

## TL;DR

If you have an target distribution $p(x)$ that you want to model, but you can’t compute it (maybe because it involves a partition function), then starting with an initial set of particles $\left\{x_i^0\right\}_{i=1}^n$, you can iteratively update them as following:

$\displaystyle x_i^{l+1} \leftarrow x_i^l + \epsilon_l\hat{\phi}^*\left(x_i^l\right)$

where:

• $\epsilon_l$ is a small step size at iteration $l$
• $\displaystyle \hat{\phi}^*\left(x\right) = \frac{1}{n}\sum_{j=1}^n \left[k\left(x_j^l, x\right)\nabla_{x_j^l}\log p\left(x_j^l\right) + \nabla_{x_j^l}k\left(x_j^l, x\right)\right]$ is the steepest update direction
• $k\left( \cdot , \cdot \right)$ is a kernel, typically RBF.

At the end, this update rule gives a set of particle $\left\{x_i\right\}_{i=1}^n$ that approximates the target distribution.

Note that to compute the update direction, you only need to compute the derivative of the log of the (unnormalized) target distribution with respect to a set of samples $\left\{x_j^l\right\}_{j=1}^n$.

This result is significant because it is analogous to gradient descent that minimizes a KL divergence. The set of particles $\left\{x_i^0\right\}_{i=1}^n$, therefore, normally comes from another black-box model.

An interesting intuition is that this direction will push the particles into the regions with high values of $p(x)$, while the second term (derivatives of the kernel), for the case of RBF, has “regularization” effect, which keeps the particles away from each other, prevent them from collapsing into the same mode. This is how the method is probably better than pure MCMC.

## The short story

Let $x$ be a continuous random variables in $\mathcal{X} \subset \mathbb{R}^d$ and $p\left(x\right)$ is the (intractable) target distribution. Let $\phi\left(x\right) = \left[\phi_1\left(x\right), ..., \phi_d\left(x\right)\right]^T$ a smooth vector-valued function, the so-called Stein‘s identity says:

$\displaystyle \mathbb{E}_{x \sim p}\left[\mathcal{A}_p\phi\left(x\right)\right] = 0$

where $\mathcal{A}_p\phi\left(x\right) = \phi\left(x\right)\nabla_x\log p\left(x\right)^T + \nabla_x\phi\left(x\right)$ and $\mathcal{A}_p$ is called the Stein operator.

Let $q\left(x\right)$ be another distribution. Now $\mathbb{E}_{x \sim q}\left[\mathcal{A}_p\phi\left(x\right)\right]$ will no longer be zero. It turns out this can be used to define the discrepancy between two distributions p and q:

$\displaystyle \mathbb{S}\left(q, p\right) = \max_\phi \left[\mathbb{E}_{x \sim q}\text{trace}\left(\mathcal{A}_p\phi\left(x\right)\right)\right]^2$

Meaning we consider all possible smooth function $\phi$ and use the one that maximizes the violation of the Stein’s identity. This maximum violation is defined to be the Stein discrepancy between q and p.

Considering all possible $\phi$ is impractical. But it turns out if we consider $\phi$ in the unit ball of a reproducing kernel Hilbert space (RKHS) $\mathcal{H}^d$, then the kernelized Stein discrepancy is defined as:

$\displaystyle \mathbb{KS}\left(q, p\right) = \max_{\phi\in\mathcal{H}^d} \left[\mathbb{E}_{x \sim q}\text{trace}\left(\mathcal{A}_p\phi\left(x\right)\right)\right]^2\quad \text{s.t.} \parallel \phi\parallel_{\mathcal{H}^d} \leq 1$

and this has a closed-form solution: $\displaystyle \phi_{q,p}^*\left(\cdot\right) = \mathbb{E}_{x \sim q}\left[\mathcal{A}_pk\left(x,\cdot\right)\right]$.

Now if we take $T\left(x\right) = x + \epsilon\phi\left(x\right)$ and $q_{[T]}$ is the distribution of $T$ when $x \sim q(x)$, then [1] shows that the derivatives of the KL divergence between $q_{[T]}$ and p has an interesting form:

$\nabla_\epsilon KL\left(q_{[T]} \parallel p\right)\vert_{\epsilon = 0} = -\mathbb{E}_{x \sim q}\left[\text{trace}\left(\mathcal{A}_p\phi\left(x\right)\right)\right]$

(without the square)

Relating to all the above, we now have the direction that minimizes the KL divergence is the expectation of the Stein operator:

$\displaystyle \phi_{q,p}^*\left(\cdot\right) = \mathbb{E}_{x \sim q}\left[k(x,\cdot)\nabla_x\log p(x) + \nabla_x k(x,\cdot)\right]$

## The long story

The above is a huge simplification, don’t take it too seriously. If you are dying for more details, the following might help:

[1] https://arxiv.org/abs/1602.03253 introduce the kernelized Stein discrepancy
[2] https://arxiv.org/abs/1608.04471 presents the SVGD algorithm
[3] https://arxiv.org/abs/1611.01722 uses the algorithm to estimate parameters of an energy-based deep neural net.

I learned about the Simpson’s paradox fairly recently, and I found it quite disturbing, not because of the mere “paradox” itself, but mainly because I felt it was something I should have known already.

In case you haven’t heard about it, one instance of the paradox is a real-world medical study for comparing the success rate of two treatments for kidney stones (from Wikipedia):

Overall, Treatment B is better because its success rate is 83%, compared to 78% of Treatment A. However, when they split the patients into 2 groups: those with small stones and those with large stones, then Treatment A is better than Treatment B in both subgroups. Paradoxical enough?

Well, it’s not. It turns out that for severe cases (large stones), doctors tend to give the more effective Treatment A, while for milder cases with small stones, they tend to give the inferior Treatment B. Therefore the sum is dominant by group 2 and group 3, while the other groups contribute little to the final sums. So the results can be interpreted more accurately as: when Treatment B is more frequently applied to less severe cases, it can appear to be more effective.

Now, knowing that Treatment and Stone size are not independent, this should not come up as a paradox. In fact, we can visualize the problem as a graphical model like this

All the numbers in the table above can be expressed as conditional probabilities like so:

• Group 1: $p\left(S=true \vert T=A, St=small\right) = 0.93$
• Group 2: $p\left(S=true \vert T=B, St=small\right) = 0.87$
• Group 3: $p\left(S=true \vert T=A, St=large\right) = 0.73$
• Group 4: $p\left(S=true \vert T=B, St=large\right) = 0.69$
• $p\left(S=true \vert T=A\right) = 0.78$
• $p\left(S=true \vert T=B\right) = 0.83$

For any of us who studied Probability, it is no surprise that the probabilities might turn up-side-down whenever some conditional variables are stripped out of the equations. In this particular case, since S depends on both St and T, the last 2 equations do not bring any new knowledge about S.

So what is this “paradox” about? Isn’t it nothing more than the problem of confounding/lurking variables, something that most people in Probability/Statistics already known? In this particular case, Stone size is the lurking variable that dictates both Treatment and Success, therefore the scientists who designed the experiment should have taken it into account since the beginning. It is well-known among Statistic practitioners that they must try their best to identify and eliminate the effect of any lurking variables in their experiments, or at least keep them fixed, before drawing any meaningful conclusion.

From a slightly different perspective, the paradox can be understood once we understand the human bias of drawing causal relations. Human, perhaps for the sake of survival, constantly look for causal relations and often tend to ignore rates or proportions. Once we conceived something as being causal (Treatment B gives higher success rate than Treatment A in general), which might be wrong, we continue to assume a causal relation and proceed with that assumption in mind. Obviously with this assumption, we will find the success rates for the subgroups of patients to be highly counter-intuitive, or even paradoxical.

In fact, the connection of this paradox to human intuitions is so important that Judea Pearl dedicated a whole section in his book for it. Modern Statistical textbooks and curriculum, however, don’t even mention it. Instead they will generally present the topic along with lurking/confounding variables.

Therefore, if you haven’t heard about this, it is probably for a good reason, or perhaps you are simply too young.

# [RL4a] Policy Optimization

I thought I would write about some theory behind Reinforcement Learning, with eligibility traces, contraction mapping, POMDP and so on, but then I realized if I go down that rabbit hole, I would probably never finish this. So here are some practical stuff that people are actually using these days.

Two main approaches in model-free RL are policy-based, where we build a model of the policy function, and value-based, where we learn the value function. For many practical problems, the structure of the (unknown) policy function might be easier to learn than the (unknown) value function, in which case it makes more sense to use policy-based methods. Moreover, with policy-based methods, we will have an explicit policy, which is the ultimate goal of Reinforcement learning to control (the other type being learn-to-predict, more on that some other time). With value-based methods, like DQN, we will need to do an additional inference step to get the optimal policy.

The hybrid of policy-based and value-based is called Actor-Critic methods, which hopefully will be mentioned in another post.

One of the most straightforward approach in policy-based RL is, unsurprisingly, evolutionary algorithms. In this approach, a population of policies is maintained and evolutionized over time. People show that this works pretty well, e.g. for the Tetris game. However, due to the randomness, this is apparently only applied to problems where the number of parameters of the policy function is small.

A big part of policy-based methods, however, is based on Policy Gradient, where an exact estimate of the gradient of the expected future reward can be computed. Roughly speaking, there is an exact formulation for the gradient of the policy, which we can then use to optimize the policy. Since Deep Learning people basically worship gradients (pun intended), this method suites very well and became trendy about 2 years ago.

So, what is it all about? It is based on a very simple trick called Likelihood Ratio.

# Kalman filters (and how they relate to HMMs)

Kalman filters are insanely popular in many engineering fields, especially those involve sensors and motion tracking. Consider how to design a radar system to track military aircrafts (or warships, submarines, … for that matter), how to track people or vehicles in a video stream, how to predict location of a vehicle carrying a GPS sensor… In all these cases, some (advanced) variation of Kalman filter is probably what you would need.

Learning and teaching Kalman filters is therefore quite challenging, not only because of the mere complexity of the algorithms, but also because there are many variations of them.

With a Computer Science background, I encountered Kalman filters several years ago, but never managed to put them into the global context of the field. I had chances to look at them again recently, and rediscovered yet another way to present and explain Kalman filters. It made a lot of sense to me, and hopefully it does to you too.

Note that there are a lot of details missing from this post (if you are building a radar system to track military aircrafts, look somewhere else!). I was just unhappy to see many introductory material on Kalman filters are either too engineering or too simplified. I want something more Machine Learning-friendly, so this is my attempt.

Let’s say you want to track an enemy’s aircraft. All you have is a lame radar (bought from Russia probably) which, when oriented properly, will give you a 3-tuple of range, angle and angular velocity $[r \;\phi\;\dot{\phi}]^{T}$ of the aircraft. This vector is called the observation $\mathbf{z}_k$ (subscript $k$ because it depends on time). The actual position of the aircraft, though, is a vector in cartesian coordinates $\mathbf{x}_k = [x_1\;x_2\;x_3]^{T}$. Since it is an enemy’s aircraft, you can only observe $\mathbf{z}_k$, and you want to track the state vector $\mathbf{x}_k$ over time, every time you receive a measurement $\mathbf{z}_k$ from the radar.

Visualised as a Bayesian network, it looks like this:

With all the Markov properties hold, i.e. $\mathbf{x}_k$ only depends on $\mathbf{x}_{k-1}$ and $\mathbf{z}_k$ only depends on $\mathbf{x}_k$, does this look familiar?

# Variational Autoencoders 3: Training, Inference and comparison with other models

Variational Autoencoders 1: Overview
Variational Autoencoders 2: Maths
Variational Autoencoders 3: Training, Inference and comparison with other models

Recalling that the backbone of VAEs is the following equation:

$\log P\left(X\right) - \mathcal{D}\left[Q\left(z\vert X\right)\vert\vert P\left(z\vert X\right)\right] = E_{z\sim Q}\left[\log P\left(X\vert z\right)\right] - \mathcal{D}\left[Q\left(z\vert X\right) \vert\vert P\left(z\right)\right]$

In order to use gradient descent for the right hand side, we need a tractable way to compute it:

• The first part $E_{z\sim Q}\left[\log P\left(X\vert z\right)\right]$ is tricky, because that requires passing multiple samples of $z$ through $f$ in order to have a good approximation for the expectation (and this is expensive). However, we can just take one sample of $z$, then pass it through $f$ and use it as an estimation for $E_{z\sim Q}\left[\log P\left(X\vert z\right)\right]$ . Eventually we are doing stochastic gradient descent over different sample $X$ in the training set anyway.
• The second part $\mathcal{D}\left[Q\left(z\vert X\right) \vert\vert P\left(z\right)\right]$ is even more tricky. By design, we fix $P\left(z\right)$ to be the standard normal distribution $\mathcal{N}\left(0,I\right)$ (read part 1 to know why). Therefore, we need a way to parameterize $Q\left(z\vert X\right)$ so that the KL divergence is tractable.

Here comes perhaps the most important approximation of VAEs. Since $P\left(z\right)$ is standard Gaussian, it is convenient to have $Q\left(z\vert X\right)$ also Gaussian. One popular way to parameterize $Q$ is to make it also Gaussian with mean $\mu\left(X\right)$ and diagonal covariance $\sigma\left(X\right)I$, i.e. $Q\left(z\vert X\right) = \mathcal{N}\left(z;\mu\left(X\right), \sigma\left(X\right)I\right)$, where $\mu\left(X\right)$ and $\sigma\left(X\right)$ are two vectors computed by a neural network. This is the original formulation of VAEs in section 3 of this paper.

This parameterization is preferred because the KL divergence now becomes closed-form:

$\displaystyle \mathcal{D}\left[\mathcal{N}\left(\mu\left(X\right), \sigma\left(X\right)I\right)\vert\vert P\left(z\right)\right] = \frac{1}{2}\left[\left(\sigma\left(X\right)\right)^T\left(\sigma\left(X\right)\right) +\left(\mu\left(X\right)\right)^T\left(\mu\left(X\right)\right) - k - \log \det \left(\sigma\left(X\right)I\right) \right]$

Although this looks like magic, but it is quite natural if you apply the definition of KL divergence on two normal distributions. Doing so will teach you a bit of calculus.

So we have all the ingredients. We use a feedforward net to predict $\mu\left(X\right)$ and $\sigma\left(X\right)$ given an input sample $X$ draw from the training set. With those vectors, we can compute the KL divergence and $\log P\left(X\vert z\right)$, which, in term of optimization, will translate into something similar to $\Vert X - f\left(z\right)\Vert^2$.

It is worth to pause here for a moment and see what we just did. Basically we used a constrained Gaussian (with diagonal covariance matrix) to parameterize $Q$. Moreover, by using $\Vert X - f\left(z\right)\Vert^2$ for one of the training criteria, we implicitly assume $P\left(X\vert z\right)$ to be also Gaussian. So although the maths that lead to VAEs are generic and beautiful, at the end of the day, to make things tractable, we ended up using those severe approximations. Whether those approximations are good enough totally depend on practical applications.

There is an important detail though. Once we have $\mu\left(X\right)$ and $\sigma\left(X\right)$ from the encoder, we will need to sample $z$ from a Gaussian distribution parameterized by those vectors. $z$ is needed for the decoder to reconstruct $\hat{X}$, which will then be optimized to be as close to $X$ as possible via gradient descent. Unfortunately, the “sample” step is not differentiable, therefore we will need a trick call reparameterization, where we don’t sample $z$ directly from $\mathcal{N}\left(\mu\left(X\right), \sigma\left(X\right)\right)$, but first sample $z'$ from $\mathcal{N}\left(0, I\right)$, and then compute $z = \mu\left(X\right) + \mu\left(X\right)Iz'$. This will make the whole computation differentiable and we can apply gradient descent as usual.

The cool thing is during inference, you won’t need the encoder to compute $\mu\left(X\right)$ and $\sigma\left(X\right)$ at all! Remember that during training, we try to pull $Q$ to be close to $P\left(z\right)$ (which is standard normal), so during inference, we can just inject $\epsilon \sim \mathcal{N}\left(0, I\right)$ directly into the decoder and get a sample of $X$. This is how we can leverage the power of “generation” from VAEs.

There are various extensions to VAEs like Conditional VAEs and so on, but once you understand the basic, everything else is just nuts and bolts.

To sum up the series, this is the conceptual graph of VAEs during training, compared to some other models. Of course there are many details in those graphs that are left out, but you should get a rough idea about how they work.

In the case of VAEs, I added the additional cost term in blue to highlight it. The cost term for other models, except GANs, are the usual L2 norm $\Vert X - \hat{X}\Vert^2$.

GSN is an extension to Denoising Autoencoder with explicit hidden variables, however that requires to form a fairly complicated Markov Chain. We may have another post  for it.

With this diagram, hopefully you will see how lame GAN is. It is even simpler than the humble RBM. However, the simplicity of GANs makes it so powerful, while the complexity of VAE makes it quite an effort just to understand. Moreover, VAEs make quite a few severe approximation, which might explain why samples generated from VAEs are far less realistic than those from GANs.

That’s quite enough for now. Next time we will switch to another topic I’ve been looking into recently.

# Variational Autoencoders 2: Maths

Variational Autoencoders 1: Overview
Variational Autoencoders 2: Maths
Variational Autoencoders 3: Training, Inference and comparison with other models

Last time we saw the probability distribution of $X$ with a latent variable $z$ as follows:

$\displaystyle P(X) = \int P\left(X\vert z; \theta\right)P(z)dz$  (1)

and we said the key idea behind VAEs is to not sample $z$ from the whole distribution $P\left(z\right)$, but actually from a simpler distribution $Q\left(z\vert X\right)$. The reason is because most of $z$ will likely to give $P\left(X\vert z\right)$ close to zero, and therefore making little contribution to the estimation of $P\left(X\right)$. Now if we sample $z \sim Q\left(z\vert X\right)$, those values of $z$ will more likely to generate $X$ in the training set. Moreover, we hope that $Q$ will has less modes than $P\left(z\right)$, and therefore easier to sample from. The intuition of this is the locations of the modes of $Q\left(z\vert X\right)$ depends on $X$, and this flexibility will compensate the limitation of the fact that $Q\left(z\vert X\right)$ is simpler than $P\left(z\right)$.

But how $Q\left(z\vert X\right)$ can help with modelling $P\left(X\right)$? If $z$ is sampled from $Q$, then using $f$ we will get $E_{z \sim Q}P\left(X\vert z\right)$. We will then need to show the relationship of this quantity with $P\left(X\right)$, which is the actual quantity we want to estimate. The relationship between $E_{z \sim Q}P\left(X\vert z\right)$ and $P\left(X\right)$ is the backbone of VAEs.

We start with the KL divergence of $Q\left(z\vert X\right)$ and $P\left(z\vert X\right)$:

$\mathcal{D}\left[Q\left(z\vert X\right) \vert\vert P\left(z\vert X\right)\right] = E_{z\sim Q}\left[\log Q\left(z\vert X\right) - log P\left(z\vert X\right)\right]$

The unknown quantity in this equation is $P\left(z\vert X\right)$, but at least we can use Bayes rule for it:

$\mathcal{D}\left[Q\left(z\vert X\right) \vert\vert P\left(z\vert X\right)\right] = E_{z\sim Q}\left[\log Q\left(z\vert X\right) - log P\left(X\vert z\right) - \log P\left(z\right)\right] + \log P\left(X\right)$

Rearrange things a bit, and apply the definition of KL divergence between $Q\left(z\vert X\right)$ and $P\left(z\right)$, we have:

$\log P\left(X\right) - \mathcal{D}\left[Q\left(z\vert X\right)\vert\vert P\left(z\vert X\right)\right] = E_{z\sim Q}\left[\log P\left(X\vert z\right)\right] - \mathcal{D}\left[Q\left(z\vert X\right) \vert\vert P\left(z\right)\right]$    (2)

If you forget everything, this formula is the thing you should remember. It is therefore important to understand what it means:

• The left-hand-side is exactly what we want to optimize, plus an error term. The smaller this error term is, the better we are in mazimizing $P\left(X\right)$. In other words, the left-hand-side is a lower-bound of what we want to optimize, hence the name variational (Bayesian).
• If $Q$ happens to be a differentiable function, the right-hand-side is something we can optimize with gradient descent (we will see how to do it later). Note that the right-hand-side happens to take the form of encoder and decoder, where $Q$ encodes $X$ into $z$, and then $P$ decodes $z$ to reconstruct $X$, hence the name “Autoencoder”. However, VAEs don’t really belong to the family of Denoising and Sparse Autoencoders, although there are indeed some connections.
• Note that $P\left(z\vert X\right)$ on the left hand side is something intractable. However, by maximizing the left hand side, we simultaneously minimize $\mathcal{D}\left[Q\left(z\vert X\right)\vert\vert P\left(z\vert X\right)\right]$, and therefore pull $Q\left(z\vert X\right)$ closer to $P\left(z\vert X\right)$. If we use a flexible model for $Q$, then we can use $Q$ as an approximation for $P\left(z\vert X\right)$. This is a nice side effect of the whole framework.

Actually the above maths existed way before VAEs. However the trick was to use a feedforward network for $Q$, which gave rise to VAEs several years ago.

Next time, we will see how to do that, and hopefully conclude this series. Then we can move on with something more interesting.

# [RL3b] Temporal Difference Learning – intuition

Như đã nói trong bài trước, ta sẽ tập trung vào Model-free RL, trong đó ta muốn học trực tiếp giá trị của các trạng thái $V(s)$ từ tập huấn luyện $^*$.

Cụ thể, tập huấn luyện sẽ gồm nhiều episodes, mỗi episode là một chuỗi $s_1 \overset{r_1}{\longrightarrow} s_2 \overset{r_2}{\longrightarrow} s_3 \longrightarrow ... \longrightarrow s_F$ mà agent thực hiện cho đến khi kết thúc cuộc đời của nó. Chẳng hạn nếu là chơi cờ thì tập huấn luyện sẽ gồm nhiều trận đấu, mỗi trận là một chuỗi các nước đi. Từ tập huấn luyện này, ta tìm cách ước lượng $V(s)$.

Tại sao lại ước lượng $V(s)$, khi trong bài trước ta nói rằng Model-free RL tìm cách ước lượng giá trị $Q(s, a)$? Nhắc lại rằng ta có thể tính $Q(s, a)$ từ $V(s)$, $R(s, a)$$T(s, a, s')$, như đã nói trong bài này, mà $R(s, a)$$T(s, a, s')$ có thể ước lượng một cách đơn giản từ tập huấn luyện (chẳng hạn dùng maximum likelihood), nên một khi có $V(s)$ ta hoàn toàn có thể tính $Q(s, a)$ nếu muốn. Tuy nhiên mục tiêu cuối cùng của RL vẫn là tìm ra policy tối ưu, chứ không phải tìm $V(s)$ hay $Q(s,a)$, mà từ $V(s)$ vẫn có thể dùng $\arg\max$ để tìm policy tối ưu, thành ra $V(s)$ hay $Q(s, a)$ cũng không còn quan trọng nữa.

Mọi chuyện sẽ dễ dàng hơn khi ta xem ví dụ sau. Cho một MDP với mô hình chuyển trạng thái T và hàm reward R như sau:

Mô hình này có 6 trạng thái. Khi chuyển từ $s_1$ sang $s_3$ thì reward nhận được là +1, nghĩa là $R(s_1, a) = +1$. Tương tự như vậy cho các trạng thái khác. $s_F$ là trạng thái cuối cùng, trò chơi kết thúc khi agent đi đến trạng thái này.