# [RL4a] Policy Optimization

I thought I would write about some theory behind Reinforcement Learning, with eligibility traces, contraction mapping, POMDP and so on, but then I realized if I go down that rabbit hole, I would probably never finish this. So here are some practical stuff that people are actually using these days.

Two main approaches in model-free RL are policy-based, where we build a model of the policy function, and value-based, where we learn the value function. For many practical problems, the structure of the (unknown) policy function might be easier to learn than the (unknown) value function, in which case it makes more sense to use policy-based methods. Moreover, with policy-based methods, we will have an explicit policy, which is the ultimate goal of Reinforcement learning to control (the other type being learn-to-predict, more on that some other time). With value-based methods, like DQN, we will need to do an additional inference step to get the optimal policy.

The hybrid of policy-based and value-based is called Actor-Critic methods, which hopefully will be mentioned in another post.

One of the most straightforward approach in policy-based RL is, unsurprisingly, evolutionary algorithms. In this approach, a population of policies is maintained and evolutionized over time. People show that this works pretty well, e.g. for the Tetris game. However, due to the randomness, this is apparently only applied to problems where the number of parameters of the policy function is small.

A big part of policy-based methods, however, is based on Policy Gradient, where an exact estimate of the gradient of the expected future reward can be computed. Roughly speaking, there is an exact formulation for the gradient of the policy, which we can then use to optimize the policy. Since Deep Learning people basically worship gradients (pun intended), this method suites very well and became trendy about 2 years ago.

So, what is it all about? It is based on a very simple trick called Likelihood Ratio.

# On “The Sympathizer”

I made great progress and almost finish The Sympathizer. There is only one final knot to be untied, but I would write something about it now, otherwise I will be too lazy when I am done with it.

Usually I am highly reluctant to read trending books. Books, especially fictions, are written to sustain the test of time, hence if a book is good for today, it shouldn’t be too bad 10 years later, otherwise it isn’t worth it. Therefore, it usually isn’t worth the effort to read a book when you are not sure if it would last for 10 years (or 5 years, perhaps).

The Sympathizer is different though. Reading a fiction of one of your countrymen in a foreign language is a pretty weird experience, so weird that I simply couldn’t resist, especially when I was in short supply of good Vietnamese books.

Without spoiling the content, here are a few random comments on the book. It was very enjoyable and turned out to be a good investment.

The story told in the book was inspired by many events that are not too unfamiliar with many Vietnamese. Even the way the war was explained, although totally different from the way it was taught in Vietnam, is in fact, well-informed and thoughtful. Therefore, if you are a self-respected Vietnamese who cares to learn about history more than what being taught in schools, the story wouldn’t be too surprising.

The surprise for me though, was the writing style. Being a debut fiction, the book was remarkable. Readers are left with the feeling that the author puts effort in every single word appeared in the book. He would use bachelor to describe someone in celibacy, or use naïveté instead of naivety, perhaps just to make the narrator sounds a bit more French. In other scenario though, he would use tummy instead of stomach, just to highlight the intimacy of the plot being told. Sentences are often short, but he does not hesitate to write sentences that are one-page long, sometimes just to make a point. I haven’t read too many fictions in English, except a few from Charles Dickens, Jack London and Dan Brown (yea, I read Dan Brown too), so I might be bias, but this kind of dedication makes the book a pleasant read.

Many people praised the book for its satire and sense of humour, but those probably come from the brutal honesty of the unnamed narrator, speaking of whom, is quite a unique character.

The narrator is a hybrid, whose parent is a French priest and a Vietnamese maid. During the war, he found himself being an assistant to a General of the Army of South Vietnam, although he is actually a sleeper agent of the North. Like any other human being, he has his own weaknesses, in this case being his bastard status, which drives him nut every time it is mentioned by other people. Having studied in the US, he consumed the Western values and culture. The whole book is, therefore in some ways, his fight to find his true identity, the true home that he really belongs to. These existentialist questions are echoed by the fact that the book was opened with a quote from Friedrich Nietzsche.

Having such a complicated background, readers could easily expect him to be quite a man they could possibly have a beer with. He would make smart, provoking comments on every single chance, from the name of the USA to that of the USSR, from sex workers to how dating works, from wines to guns, from Saigon to Hollywood, from military to, you bet, politics, philosophy and arts. He could draw, or perhaps more precisely throw, deep philosophical thoughts on seemingly random events and stories. Having seen everything from both sides, perhaps multiple sides, his opinions are well-informed, brutal and amusing at the same time. He would take every chance to reflect and show the differences, or correspondences, between Oriental and Western world, as part of his identity crisis.

I still have couples of chapters left to work on, and therefore haven’t seen everything from the book yet. However, if there is anything to criticize, I would perhaps be concerned about how naive the narrator was when it comes to his loyalty with the North. Just in the same way he cracked the American politics and culture, as well as the war, it would be amazing if he spends a bit more effort to expose the Communist side. That would make the book a fair treatment on many sides involved in this bloody war.

Moreover, although the author was tactically smart about where to let the story speaks for itself and where to make comments, sometimes he made too much of a comment, making some part of the novel a bit heavy and overdone.

Nonetheless, The Sympathizer was a good book. For many Vietnamese who are not yet exposed to the minuscule details of the Vietnam war aftermaths, this is certainly a good read. For others, this is a refreshing book that probably will keep them thinking for a while after finishing it.

# Kalman filters (and how they relate to HMMs)

Kalman filters are insanely popular in many engineering fields, especially those involve sensors and motion tracking. Consider how to design a radar system to track military aircrafts (or warships, submarines, … for that matter), how to track people or vehicles in a video stream, how to predict location of a vehicle carrying a GPS sensor… In all these cases, some (advanced) variation of Kalman filter is probably what you would need.

Learning and teaching Kalman filters is therefore quite challenging, not only because of the mere complexity of the algorithms, but also because there are many variations of them.

With a Computer Science background, I encountered Kalman filters several years ago, but never managed to put them into the global context of the field. I had chances to look at them again recently, and rediscovered yet another way to present and explain Kalman filters. It made a lot of sense to me, and hopefully it does to you too.

Note that there are a lot of details missing from this post (if you are building a radar system to track military aircrafts, look somewhere else!). I was just unhappy to see many introductory material on Kalman filters are either too engineering or too simplified. I want something more Machine Learning-friendly, so this is my attempt.

Let’s say you want to track an enemy’s aircraft. All you have is a lame radar (bought from Russia probably) which, when oriented properly, will give you a 3-tuple of range, angle and angular velocity $[r \;\phi\;\dot{\phi}]^{T}$ of the aircraft. This vector is called the observation $\mathbf{z}_k$ (subscript $k$ because it depends on time). The actual position of the aircraft, though, is a vector in cartesian coordinates $\mathbf{x}_k = [x_1\;x_2\;x_3]^{T}$. Since it is an enemy’s aircraft, you can only observe $\mathbf{z}_k$, and you want to track the state vector $\mathbf{x}_k$ over time, every time you receive a measurement $\mathbf{z}_k$ from the radar.

Visualised as a Bayesian network, it looks like this:

With all the Markov properties hold, i.e. $\mathbf{x}_k$ only depends on $\mathbf{x}_{k-1}$ and $\mathbf{z}_k$ only depends on $\mathbf{x}_k$, does this look familiar?

# On GPU architecture and why it matters

I had a nice conversation recently around the architecture of CPUs versus that of GPUs. It was so good that I still remember the day after, so it is probably worth writing down.

Note that a lot of the following are still several levels of abstraction away from the hardware, and this is in no way a rigorous discussion of modern hardware design. Still, from the software development point of view, they are adequate for everything we need to know.

It started out of the difference in allocating transistors to different components on the chip of CPU and GPU. Roughly speaking, on CPUs, a lot of transistors are reserved for the cache (several levels of those), while on GPUs, most of transistors are used for the ALUs, and cache is not very well-developed. Moreover, a modern CPU merely has a few dozen cores, while GPUs might have thousands.

Why is that? The simple answer is because CPUs are MIMD, while GPUs are SIMD (although modern nVidia GPUs are closer to MIMD).

The long answer is CPUs are designed for the Von-neumann architecture, where data and instructions are stored on RAM and then fetched to the chip on demand. The bandwidth between RAM and CPU is limited (so-called data bus and instruction bus, whose bandwidth are typically ~100 bits on modern computers). For each clock cycle, only ~100bits of data can be transfer from RAM to the chip. If an instruction or data element needed by the CPU is not on the chip, the CPU might need to wait for a few cycles before the data is fetched from RAM. Therefore, a cache is highly needed, and the bigger the cache, the better. Modern CPUs have around 3 levels of cache, unsurprisingly named L1, L2, L3… with higher level cache sits closer to the processor. Data and instructions will first be fetched to the caches, and CPU can read from the cache with much lower latency (cache is expensive though, but that is another story). In short, in order to keep the CPU processors busy, cache is used to reduce the latency of reading from RAM.

GPUs are different. Designed for graphic processing, GPUs need to compute the same, often simple, arithmetic operations on a large amount of data points, because this is what happens in 3D rendering where there are thousands of vertices need to be processed in the shader (for those who are not familiar with computer graphics, that is to compute the color values of each vertex in the scene). Each vertex can be computed independently, therefore it makes sense to have thousands of cores running in parallel. For this to be scalable, all the cores should run the same computation, hence SIMD (otherwise it is a mess to schedule thousands of cores).

For CPUs, even with caches, there are still chances that the chip requires some data or commands that are not in the cache yet, and it would need to wait for a few cycles for the data to be read from RAM. This is obviously wasteful. Modern CPUs have pretty smart and complicated prediction on where to prefetch the data from RAM to minimize latency. For example, when it enters a FOR loop, it could fetch data around the arrays being accessed and the commands around the loops. Nonetheless, even with all those tricks, there are still chances for cache misses!

One simple way to keep the CPU cores busy is context switching. While the CPU is waiting for data from RAM, it can work on something else, and this eventually keeps the cores busy, while also provides the multi-tasking feature. We are not going to dive into context switching, but basically it is about to store the current stack, restore the stack trace, reload the registers, reset the instruction counter, etc…

Let’s talk about GPUs. A typical fragment of data that GPUs have to work with are in the order of megabytes in size, so it could easily take hundreds of cycles for the data to be fetched to the cores. The question then is how to keep the cores busy.

CPUs deal with this problem by context switching. GPUs don’t do that. The threads on GPUs are not switching, because it would be problematic to switch context at the scale of thousands of cores. For the sake of efficiency, there is little of locking mechanism between GPU cores, so context switching is difficult to implement efficiently.
– In fact, the GPUs don’t try to be too smart in this regards. It simply leaves the problem to be solved at the higher level, i.e. the application level.

Talking of applications, GPUs are designed for a very specific set of applications anyway, so can we do something smarter to keep the cores busy? In graphical rendering, the usual workflow is the cores read a big chunk of data from RAM, do computation on each element of the data and write the results back to RAM (sounds like Map Reduce? Actually it is not too far from that, we can talk about GPGPU algorithms in another post). For this to be efficient, both the reading and writing phases should be efficient. Writing is tricky, but reading can be made way faster with, unsurprisingly, a cache. However, the biggest cache system on GPUs are read-only, because writable cache is messy, especially when you have thousands of cores. Historically it is called texture cache, because it is where the graphical application would write the texture (typically a bitmap) for the cores to use to shade the vertices. The cores cant write to this cache because it would not need to, but it is writable from the CPU. When people move to GPGPU, the texture cache is normally used to store constants, where they can be read by multiple cores simultaneously with low latency.

To summarize, the whole point of the discussion was about to avoid the cores being idle because of memory latency. Cache is the answer to both CPUs and GPUs, but cache on GPUs are read-only to the cores due to their massive number of cores. When cache is certainly helpful, CPUs also do context switching to further increase core utilization. GPUs, to the best of my knowledge, don’t do that much. It is left to the developers to design their algorithms so that the cores are fed with enough computation to hide the memory latency (which, by the way, also includes the transfer from RAM to GPU memory via PCIExpress – way slower and hasn’t been discussed so far).

The proper way to optimize GPGPU algorithms is, therefore, to use the data transfer latency as the guide to optimize.

Nowadays, frameworks like tensorflow or torch hide all of these details, but at the price of being a bit inefficient. Tensorflow community is aware of this and trying their best, but still much left to be done.

# Variational Autoencoders 3: Training, Inference and comparison with other models

Variational Autoencoders 1: Overview
Variational Autoencoders 2: Maths
Variational Autoencoders 3: Training, Inference and comparison with other models

Recalling that the backbone of VAEs is the following equation:

$\log P\left(X\right) - \mathcal{D}\left[Q\left(z\vert X\right)\vert\vert P\left(z\vert X\right)\right] = E_{z\sim Q}\left[\log P\left(X\vert z\right)\right] - \mathcal{D}\left[Q\left(z\vert X\right) \vert\vert P\left(z\right)\right]$

In order to use gradient descent for the right hand side, we need a tractable way to compute it:

• The first part $E_{z\sim Q}\left[\log P\left(X\vert z\right)\right]$ is tricky, because that requires passing multiple samples of $z$ through $f$ in order to have a good approximation for the expectation (and this is expensive). However, we can just take one sample of $z$, then pass it through $f$ and use it as an estimation for $E_{z\sim Q}\left[\log P\left(X\vert z\right)\right]$ . Eventually we are doing stochastic gradient descent over different sample $X$ in the training set anyway.
• The second part $\mathcal{D}\left[Q\left(z\vert X\right) \vert\vert P\left(z\right)\right]$ is even more tricky. By design, we fix $P\left(z\right)$ to be the standard normal distribution $\mathcal{N}\left(0,I\right)$ (read part 1 to know why). Therefore, we need a way to parameterize $Q\left(z\vert X\right)$ so that the KL divergence is tractable.

Here comes perhaps the most important approximation of VAEs. Since $P\left(z\right)$ is standard Gaussian, it is convenient to have $Q\left(z\vert X\right)$ also Gaussian. One popular way to parameterize $Q$ is to make it also Gaussian with mean $\mu\left(X\right)$ and diagonal covariance $\sigma\left(X\right)I$, i.e. $Q\left(z\vert X\right) = \mathcal{N}\left(z;\mu\left(X\right), \sigma\left(X\right)I\right)$, where $\mu\left(X\right)$ and $\sigma\left(X\right)$ are two vectors computed by a neural network. This is the original formulation of VAEs in section 3 of this paper.

This parameterization is preferred because the KL divergence now becomes closed-form:

$\displaystyle \mathcal{D}\left[\mathcal{N}\left(\mu\left(X\right), \sigma\left(X\right)I\right)\vert\vert P\left(z\right)\right] = \frac{1}{2}\left[\left(\sigma\left(X\right)\right)^T\left(\sigma\left(X\right)\right) +\left(\mu\left(X\right)\right)^T\left(\mu\left(X\right)\right) - k - \log \det \left(\sigma\left(X\right)I\right) \right]$

Although this looks like magic, but it is quite natural if you apply the definition of KL divergence on two normal distributions. Doing so will teach you a bit of calculus.

So we have all the ingredients. We use a feedforward net to predict $\mu\left(X\right)$ and $\sigma\left(X\right)$ given an input sample $X$ draw from the training set. With those vectors, we can compute the KL divergence and $\log P\left(X\vert z\right)$, which, in term of optimization, will translate into something similar to $\Vert X - f\left(z\right)\Vert^2$.

It is worth to pause here for a moment and see what we just did. Basically we used a constrained Gaussian (with diagonal covariance matrix) to parameterize $Q$. Moreover, by using $\Vert X - f\left(z\right)\Vert^2$ for one of the training criteria, we implicitly assume $P\left(X\vert z\right)$ to be also Gaussian. So although the maths that lead to VAEs are generic and beautiful, at the end of the day, to make things tractable, we ended up using those severe approximations. Whether those approximations are good enough totally depend on practical applications.

There is an important detail though. Once we have $\mu\left(X\right)$ and $\sigma\left(X\right)$ from the encoder, we will need to sample $z$ from a Gaussian distribution parameterized by those vectors. $z$ is needed for the decoder to reconstruct $\hat{X}$, which will then be optimized to be as close to $X$ as possible via gradient descent. Unfortunately, the “sample” step is not differentiable, therefore we will need a trick call reparameterization, where we don’t sample $z$ directly from $\mathcal{N}\left(\mu\left(X\right), \sigma\left(X\right)\right)$, but first sample $z'$ from $\mathcal{N}\left(0, I\right)$, and then compute $z = \mu\left(X\right) + \mu\left(X\right)Iz'$. This will make the whole computation differentiable and we can apply gradient descent as usual.

The cool thing is during inference, you won’t need the encoder to compute $\mu\left(X\right)$ and $\sigma\left(X\right)$ at all! Remember that during training, we try to pull $Q$ to be close to $P\left(z\right)$ (which is standard normal), so during inference, we can just inject $\epsilon \sim \mathcal{N}\left(0, I\right)$ directly into the decoder and get a sample of $X$. This is how we can leverage the power of “generation” from VAEs.

There are various extensions to VAEs like Conditional VAEs and so on, but once you understand the basic, everything else is just nuts and bolts.

To sum up the series, this is the conceptual graph of VAEs during training, compared to some other models. Of course there are many details in those graphs that are left out, but you should get a rough idea about how they work.

In the case of VAEs, I added the additional cost term in blue to highlight it. The cost term for other models, except GANs, are the usual L2 norm $\Vert X - \hat{X}\Vert^2$.

GSN is an extension to Denoising Autoencoder with explicit hidden variables, however that requires to form a fairly complicated Markov Chain. We may have another post  for it.

With this diagram, hopefully you will see how lame GAN is. It is even simpler than the humble RBM. However, the simplicity of GANs makes it so powerful, while the complexity of VAE makes it quite an effort just to understand. Moreover, VAEs make quite a few severe approximation, which might explain why samples generated from VAEs are far less realistic than those from GANs.

That’s quite enough for now. Next time we will switch to another topic I’ve been looking into recently.

# Variational Autoencoders 2: Maths

Variational Autoencoders 1: Overview
Variational Autoencoders 2: Maths
Variational Autoencoders 3: Training, Inference and comparison with other models

Last time we saw the probability distribution of $X$ with a latent variable $z$ as follows:

$\displaystyle P(X) = \int P\left(X\vert z; \theta\right)P(z)dz$  (1)

and we said the key idea behind VAEs is to not sample $z$ from the whole distribution $P\left(z\right)$, but actually from a simpler distribution $Q\left(z\vert X\right)$. The reason is because most of $z$ will likely to give $P\left(X\vert z\right)$ close to zero, and therefore making little contribution to the estimation of $P\left(X\right)$. Now if we sample $z \sim Q\left(z\vert X\right)$, those values of $z$ will more likely to generate $X$ in the training set. Moreover, we hope that $Q$ will has less modes than $P\left(z\right)$, and therefore easier to sample from. The intuition of this is the locations of the modes of $Q\left(z\vert X\right)$ depends on $X$, and this flexibility will compensate the limitation of the fact that $Q\left(z\vert X\right)$ is simpler than $P\left(z\right)$.

But how $Q\left(z\vert X\right)$ can help with modelling $P\left(X\right)$? If $z$ is sampled from $Q$, then using $f$ we will get $E_{z \sim Q}P\left(X\vert z\right)$. We will then need to show the relationship of this quantity with $P\left(X\right)$, which is the actual quantity we want to estimate. The relationship between $E_{z \sim Q}P\left(X\vert z\right)$ and $P\left(X\right)$ is the backbone of VAEs.

We start with the KL divergence of $Q\left(z\vert X\right)$ and $P\left(z\vert X\right)$:

$\mathcal{D}\left[Q\left(z\vert X\right) \vert\vert P\left(z\vert X\right)\right] = E_{z\sim Q}\left[\log Q\left(z\vert X\right) - log P\left(z\vert X\right)\right]$

The unknown quantity in this equation is $P\left(z\vert X\right)$, but at least we can use Bayes rule for it:

$\mathcal{D}\left[Q\left(z\vert X\right) \vert\vert P\left(z\vert X\right)\right] = E_{z\sim Q}\left[\log Q\left(z\vert X\right) - log P\left(X\vert z\right) - \log P\left(z\right)\right] + \log P\left(X\right)$

Rearrange things a bit, and apply the definition of KL divergence between $Q\left(z\vert X\right)$ and $P\left(z\right)$, we have:

$\log P\left(X\right) - \mathcal{D}\left[Q\left(z\vert X\right)\vert\vert P\left(z\vert X\right)\right] = E_{z\sim Q}\left[\log P\left(X\vert z\right)\right] - \mathcal{D}\left[Q\left(z\vert X\right) \vert\vert P\left(z\right)\right]$    (2)

If you forget everything, this formula is the thing you should remember. It is therefore important to understand what it means:

• The left-hand-side is exactly what we want to optimize, plus an error term. The smaller this error term is, the better we are in mazimizing $P\left(X\right)$. In other words, the left-hand-side is a lower-bound of what we want to optimize, hence the name variational (Bayesian).
• If $Q$ happens to be a differentiable function, the right-hand-side is something we can optimize with gradient descent (we will see how to do it later). Note that the right-hand-side happens to take the form of encoder and decoder, where $Q$ encodes $X$ into $z$, and then $P$ decodes $z$ to reconstruct $X$, hence the name “Autoencoder”. However, VAEs don’t really belong to the family of Denoising and Sparse Autoencoders, although there are indeed some connections.
• Note that $P\left(z\vert X\right)$ on the left hand side is something intractable. However, by maximizing the left hand side, we simultaneously minimize $\mathcal{D}\left[Q\left(z\vert X\right)\vert\vert P\left(z\vert X\right)\right]$, and therefore pull $Q\left(z\vert X\right)$ closer to $P\left(z\vert X\right)$. If we use a flexible model for $Q$, then we can use $Q$ as an approximation for $P\left(z\vert X\right)$. This is a nice side effect of the whole framework.

Actually the above maths existed way before VAEs. However the trick was to use a feedforward network for $Q$, which gave rise to VAEs several years ago.

Next time, we will see how to do that, and hopefully conclude this series. Then we can move on with something more interesting.