# [RL4a] Policy Optimization

I thought I would write about some theory behind Reinforcement Learning, with eligibility traces, contraction mapping, POMDP and so on, but then I realized if I go down that rabbit hole, I would probably never finish this. So here are some practical stuff that people are actually using these days.

Two main approaches in model-free RL are policy-based, where we build a model of the policy function, and value-based, where we learn the value function. For many practical problems, the structure of the (unknown) policy function might be easier to learn than the (unknown) value function, in which case it makes more sense to use policy-based methods. Moreover, with policy-based methods, we will have an explicit policy, which is the ultimate goal of Reinforcement learning to control (the other type being learn-to-predict, more on that some other time). With value-based methods, like DQN, we will need to do an additional inference step to get the optimal policy.

The hybrid of policy-based and value-based is called Actor-Critic methods, which hopefully will be mentioned in another post.

One of the most straightforward approach in policy-based RL is, unsurprisingly, evolutionary algorithms. In this approach, a population of policies is maintained and evolutionized over time. People show that this works pretty well, e.g. for the Tetris game. However, due to the randomness, this is apparently only applied to problems where the number of parameters of the policy function is small.

A big part of policy-based methods, however, is based on Policy Gradient, where an exact estimate of the gradient of the expected future reward can be computed. Roughly speaking, there is an exact formulation for the gradient of the policy, which we can then use to optimize the policy. Since Deep Learning people basically worship gradients (pun intended), this method suites very well and became trendy about 2 years ago.

So, what is it all about? It is based on a very simple trick called Likelihood Ratio.

This is the slides I used for a talk I did recently in our reading group. The slides, particularly the Attention part, was based on one of Quoc Le’s talks on the same topic. I couldn’t come up with any better visual than what he did.

It has been quite a while since the last time I look at this topic, unfortunately I never managed to fully anticipate its beauty. Seq2Seq is one of those simple-ideas-that-actually-work in Deep Learning, which opened up a whole lot of possibilities and enabled many interesting work in the field.

A friend of mine did Variational Inference for his PhD, and once he said Variational Inference is one of those mathematically-beautiful-but-don’t-work things in Machine Learning.

Indeed, there are stuff like Variational, Bayesian inference, Sum-Product Nets etc… that come with beautiful mathematical frameworks, but don’t really work at scale, and stuff like Convolutional nets, GANs, etc.. that are a bit slippery in their mathematical foundation, often empirically discovered, but work really well in practice.

So even though many people might not really like the idea of GANs, for example, but given this “empirical tradition” in Deep Learning literature, probably they are here to stay.

So I read the Learning to learn paper a while ago, and I was surprised that the Decoupled Neural Interfaces paper didn’t cite them. For me the ideas are pretty close, where you try to predict the gradient used in each step of gradient descent, instead of computing it by backpropagation. Taking into account that they are all from DeepMind, won’t it be nice to cite each other and increase the impact factors for both of them?

Nevertheless, I enjoyed the paper. The key idea is instead of doing a normal update $\theta_{t+1} = \theta_{t} - \alpha_t \nabla f\left(\theta_t\right)$, we do it as $\theta_{t+1} = \theta_{t} + g_t\left(\nabla f\left(\theta_t\right), \phi\right)$ where $g_t$ is some function parameterized by $\phi$.

Now one can use any function approximator for $g_t$ (called optimizer, to differentiate with $f\left(\theta\right)$ – the optimizee), but using RNNs has a particular interesting intuition as we hope that the RNNs can remember the gradient history and mimic the behavior of, for instance, momentum.

The convenient thing about this framework is that the objective function for training the optimizer is the expected weighted sum of the output of the optimizee $f\left(\theta\right)$. Apart from this main idea, everything else is nuts and bolts, which of course are equivalently important.

The first obstacle that they had to solve is how to deal with big models of perhaps millions of parameters. In such cases, $g_t$ has to input and output vector of millions of dimensions. Instead, the authors solved this problem very nicely by only working with one parameter at a time, i.e. the optimizer only takes as input one element of the gradient vector and output the update for that element. However, since the optimizer is a LSTM, the state of the gradient coordinates are maintained separately. This also has a nice side effect that  it reduces the size of the optimizer, and you can potentially re-use the optimizer for different optimizees.

The next two nuts and bolts are not so obvious. To mimic the L2 gradient clipping trick, they used the so-called global averaging cell (GAC), where the outgoing activations of LSTM cells are averaged at each step across all coordinates. To mimic Hessian-based optimization algorithms, they wire the LSTM optimizer with an external memory unit, hoping that the optimizer will learn to store the second-order derivatives in the memory.

Although the experimental results look pretty promising, many people pose some doubts about the whole idea of learning to learn. I was in the panel discussion of Learning to learn at NIPS, and it wasn’t particularly fruitful (people were drinking sangria all the time). It will be interesting to see the follow-ups on this line of work, if there is any.

# From self-driving cars to model and dataset sizes

So I am done with teaching a vehicle to drive itself!

Errh, not quite there yet. I did it on a simulator, in an easy environment where there is only one lane, and no other traffic. This is very far from an actual self-driving vehicle.

Nevertheless, I had a lot of fun. It was actually way easier than I initially thought. It is simply a regression problem, where a CNN was trained to predict the steering angle. A vanila CNN with a significant amount of training data would do the job quite easily. Although it sounds simple, eventually this is how nVidia drives a car with their DAVE-2 system.

In practice, self-driving car is a bit more complicated. For example, nVidia’s paper didn’t show how they would handle traffic lights. I guess the Deep Learning way for that would be to collect a lot more data at crossroads, but I feel that would not be enough. At some point, you will need traditional engineering methods like sensor fusion to precisely locate the car on the road (more precise than what GPS provides), path finding for planning and all kinds of other signals.

However, every time I apply Deep Learning to a new domain, I learned something new. For this project, it is the following:

• On the vehicle, there are 3 cameras: one in the middle, one on the left and one on the right. Normally you just need to train the CNN to map the image collected from the center camera to the steering angle, and be done with it. However, it turns out that you can use the side cameras to teach the vehicle to recover from mistakes. For example, if the car is taking a left turn, then you can use the image from the left camera to teach it to do a softer left turn, and the image from the right camera do a harder left turn. Using this approach, during inference, you only need to run inference on the center image. How much softer and harder should be empirically determined.
You might think that you can read 3 images in the same time, and feed all three into the network, but that will require 3 images during inference, which might slow down the inference.
In fact the above technique is used by nVidia in their paper, and it could help the vehicle to recover from mistake, for example when it is close to the edge of the road.
Another data augmentation technique is to vertically flip the images, and reverse the steering angle. Using both techniques, you can augment the training set by a factor of 6.
• Inference time is crucial. In the beginning, I struggled a lot making the model to work. Then at some point I realize that it took around 0.1 second to evaluate the model, which might be too slow to drive a car. I then reduce the size of the model, until the point where it takes 0.01 seconds to evaluate, then the vehicle starts driving smoothly.

So how small (or big) your model should be? This obviously depends on the training set,  but is there any rule of thumb? A related question that some people also asked me is how big the training set should be? We keep saying Deep Learning needs big datasets, but how big is big, or how big should it be to expect some sensible performance? I hope the rest of this post could answer those questions.

How big the model should be?

Let’s say you have a training set of N samples. Now if I use a simple array of bits to store those samples, then I would need N bits to store N samples (the first bit is ON given the first sample, and so on). More strictly, I could say I only need $\log_2\left(N\right)$ bits to store N samples, because I could have N different configurations with that many bits.

In Deep Learning, we are well graduated from speaking in bits, but the same principle still holds. The easiest answer is you will need to construct your model so that it has N parameters to learn a training set of N samples.

That is still too lax though. Recall that a parameter in a neural net is a 32-bit floating point number, so a model of N parameters will have $32N$ bits in total. That’s why you would only need a model of $\frac{N}{32}$ parameters?

Not that strict. Although the parameters in neural nets are floating points, their values are often small, typically in the range of -0.3 to 0.3 (depending on how you normalize the data). This is due to various tricks we apply to the nets like initialization and small learning rate, in order to make optimization easier.

Since their values are restricted, probably only a few bits in each parameters are carrying useful information. How many is that? Typically people think it is about 8 or 16 bits. The proof for that is when you quantize the nets to low-precision (of 8 or 16 bits), then the performance of the net doesn’t decrease much.

So, as a typical (wild) rule of thumb, you should be able to overfit a training set of size N with a model of $\frac{N}{4}$ parameters. If you cannot overfit the training set, you are doing something really wrong with your initialization, learning rate and regularizer.

So you need to know how to count the number of parameters in a deep net. For fully connected layers, that simply is the size of the weight matrix and the biases. For convolutional layers, it is the size of the filter, multiplied by the number of filters. Most  modern Deep learning framework doesn’t use biases for convolutional layer, but in the past, people used to use a bias for each filter, so keep in mind that if you want to be very precise. The vanila RNN can be computed similarly.

LSTM is a bit more tricky, because there are a few variants of those: whether peephole is enabled, whether the forget bias is fixed, is it multi-dimensional LSTM, etc.. so the exact number might vary. However in general, the number of parameters of an LSTM layers of p units with inputs should be in the order of $5pq$.

Some time ago I used to write a python script to compute the exact number of parameters in a MDLSTM cell, but looking at it now took me some time to understand it.

I hope this points out that the key advantage of Deep Learning, compared to traditional method, is we can engineer the model as big as we want, sometimes depending on the dataset. This is not easily doable with other models like SVM and the like.

How big is the training set?

Using a similar reasoning, you could also answer this pretty easily.

Assume that your input is a N-dimensional vector, then the maximum number of configuration in that space is $2^{32N}$, which is enormous (sorry for using the word, you have Donald Trump to blame).

Of course that is the number of distinct configuration for all possible input. Your input domain is likely going to be a manifold in that high-dimensional space, meaning it will probably only take a tenth of that many degrees of freedom. So let’s say $2^{N/10}$.

Now you don’t need every sample in your input domain to train a deep model. As long as your input domain is relatively smooth, and the training set covers the most important modes in the data distribution, the model should be able to figure out the missing regions. So again, probably you only need a fifth of those, meaning around $2^{N/50}$ samples.

For instance in MNIST, the input is of $28 * 28 = 784$ dimensions, then you should have around $2^{784/50} \approx 32000$ samples. In fact there are 50000 samples in the MNIST training set.

In general, I think the rule of thumb would be around tens of thousands samples for a typical problem so that you can expect some optimistic results.

Note that those calculations are very coarse, and should only be used to give some intuition. They shouldn’t be used as an exact calculation as-it-is.

The problem is worse with time series and sequential data in general. Using the same calculation, you would end up with pretty big numbers because you need to multiply the numbers by the length of the sequence. I don’t think the same calculation can be applied for sequential data, because in sequences, the correlation between consecutive elements also play a big role in learning, so that might lax or limit the degree of freedom of the data. However, I used to work with small sequence dataset of size around tens of thousands samples. For difficult datasets, we might need half a million of samples.

The more you work on modelling, the more you learn about it. As always, I would love to hear your experience!

# On weight sharing in Deep Net

Today I learned how to explain weight sharing to a Deep Learning noob.

Going through the basic, fundamental materials over and over again often reveals interesting insights that you might never be aware of. This is such an example.

The whole point of many pattern recognition problems is about recognizing invariants. The same object that appears in different locations in an image (so-called shift invariant), the same word appears at different location in many sentences but bearing the same meaning, etc.. are all invariants, and you want your model to learn those invariants efficiently, without having to add more capacity to it.

In the Deep Learning way, it is achieved by weight sharing:

• In Convolutional nets, weight sharing is achieved by the convolutional filters.
• In text processing, weight sharing manifests itself as word/character embeddings.
• In sequential data, weight sharing between the consecutive steps in a sequence is what we call recurrent neural nets.

Weight sharing doesn’t only help the model to be more parameter-efficient, but also make learnings to be somewhat easier, because the model can reuse what it learned in different contexts.

Now, this might sound obvious to many of us, but probably it took the community quite a bit of time to come up with the current understanding.

People take for granted so many amazing ideas in Deep Learning nowadays, which by itself is pretty amazing.

# Deep Generative models – part 2: GSN, GAN and Ladder nets

Trong bài trước, ta đã nói vắn tắt về bài toán Generative Modeling trong Deep Learning. Bài này sẽ nói tiếp chuyện này một cách formal hơn, đồng thời điểm qua một số phương pháp đang “thịnh hành” trong cộng đồng gần đây. Lưu ý rằng Generative Modeling vẫn là một bài toán chưa được giải quyết trọn vẹn, thành ra có thể có vài cách tiếp cận khác không được điểm danh ở đây, và đa số các cách tiếp cận nói đến trong bài này đều vẫn còn đang ở trong tình trạng đang được nghiên cứu.

Bài toán ultimate trong học máy thống kê có lẽ là bài toán này: cho một tập mẫu $\left\{x_i \right\}_{i=1}^N$ được lấy từ phân phối xác suất $P\left(X\right)$ chưa biết. Xây dựng mô hình để “bắt chước” phân phối $P\left(X\right)$ này.

Bài toán này khó vì ta giả sử rằng ta không biết gì về $P\left(X\right)$, ngoại trừ một tập mẫu hữu hạn của nó. Hơn nữa, trong nhiều trường hợp, $P\left(X\right)$ có thể rất phức tạp, chẳng hạn như mô hình sinh ra tất cả các ảnh RGB chụp phong cảnh tự nhiên, hay là ảnh X-quang chụp phổi bị ung thư, mô hình phái sinh ra thơ Shakespeare, v.v…. Trong những trường hợp như vậy, mô hình hoá trực tiếp $P\left(X\right)$ có thể rất khó.

Mô hình mà mình nghĩ là “đơn giản” nhất trong Deep Learning để giải quyết bài toán này có lẽ là Sum-Product Networks (SPN). Ý tưởng chính của SPN là thiết kế mạng sao cho nó tractable sẵn, vì vậy huấn luyện SPN không cần phải quan tâm đến partition fuction, vì tính partition fuction trong SPN lúc nào cũng tractable (by construction). Mặc dù ý tưởng này rất tốt, nhưng một vài kết quả thực nghiệm cho thấy chính vì ràng buộc này mà có thể lớp hàm SPN có thể xấp xỉ không đủ lớn để mô hình hoá các phân phối phức tạp trong thực tế.

Ngoài SPN, một mô hình khác có vẻ hứa hẹn sẽ giải quyết được vấn đề này là Autoencoders, nhất là thể loại Denoising Autoencoder (DAE). DAE là mô hình rất đơn giản với chỉ một hidden layer. Đầu tiên ta chọn một phân phối nhiễu $\mathcal{C}\left(\tilde{X}\vert X\right)$. Với mỗi mẫu “sạch” $X$ từ phân phối $P\left(X\right)$, ta áp dụng mô hình nhiễu, chẳng hạn nếu X là ảnh thì ta có thể thêm nhiễu Gaussian vào để tạo thành bản “lỗi” $\tilde{X}$. Sau đó ta đưa $\tilde{X}$ vào cho DAE và huấn luyện để nó làm sạch nhiễu cho ta mẫu $X$ ban đầu.

Nói theo ngôn ngữ function approximation thì DAE thực chất được huấn luyện để xấp xỉ phân phối có điều kiện $P\left(X \vert \tilde{X}\right)$ (gọi là reconstruction function, vì đây là hàm sẽ cho ta phiên bản sạch $X$ từ bản nhiễu $\tilde{X}$). Người ta cho rằng xấp xỉ $P\left(X \vert \tilde{X}\right)$ dễ hơn nhiều so với xấp xỉ $P\left(X\right)$, vì về cơ bản $P\left(X \vert \tilde{X}\right)$ sẽ gồm ít mode hơn so với $P\left(X\right)$. (more…)

# ExcelRNN, ExcelLSTM and ExcelGRU

So I heard you are fancy trying out ExcelNet? Well in addition to convolutional nets, you might as well run recurrent nets on it!

I went on and made a Google Sheets document for running several kinds of recurrent nets. I used it to debug some of the code I wrote recently, but then I think it might be fun to share it out. The current version only has 2 dimensions in the input, but adding new dimensions should be easy (update the weights and formulas, and you get the idea…).

Since I still might need it every now and then, I made it read-only. But feel free to make a copy and play with it yourself.

Give it a try here and let me know what you think.