[RL4a] Policy Optimization

I thought I would write about some theory behind Reinforcement Learning, with eligibility traces, contraction mapping, POMDP and so on, but then I realized if I go down that rabbit hole, I would probably never finish this. So here are some practical stuff that people are actually using these days.

Two main approaches in model-free RL are policy-based, where we build a model of the policy function, and value-based, where we learn the value function. For many practical problems, the structure of the (unknown) policy function might be easier to learn than the (unknown) value function, in which case it makes more sense to use policy-based methods. Moreover, with policy-based methods, we will have an explicit policy, which is the ultimate goal of Reinforcement learning to control (the other type being learn-to-predict, more on that some other time). With value-based methods, like DQN, we will need to do an additional inference step to get the optimal policy.

The hybrid of policy-based and value-based is called Actor-Critic methods, which hopefully will be mentioned in another post.

One of the most straightforward approach in policy-based RL is, unsurprisingly, evolutionary algorithms. In this approach, a population of policies is maintained and evolutionized over time. People show that this works pretty well, e.g. for the Tetris game. However, due to the randomness, this is apparently only applied to problems where the number of parameters of the policy function is small.

A big part of policy-based methods, however, is based on Policy Gradient, where an exact estimate of the gradient of the expected future reward can be computed. Roughly speaking, there is an exact formulation for the gradient of the policy, which we can then use to optimize the policy. Since Deep Learning people basically worship gradients (pun intended), this method suites very well and became trendy about 2 years ago.

So, what is it all about? It is based on a very simple trick called Likelihood Ratio.

Advertisements

Random thoughts on topological sort, backpropagation and stuff you learn in school

I learned the topological sort algorithm about 10 years ago in my undergrad, and for me it was pretty bizarre. I remember it was in one of the Introduction to Algorithms modules, where we were dealing with algorithms for graphs, like Binary search trees, RB trees and so on. Then suddenly, out of nowhere, we got to learn topological sort – an algorithm for sorting partially ordered sets.

Erhh.. what sets were that?

I mean I got RB tree because it is beautiful, and even though you probably won’t need to directly implement it at all in your whole career, learning the thing will improve your algorithmic mindset. But giving an order to a set of things which might not have a well-defined total order? Sorry but how useful it is gonna be?

Turns out the algorithm is neither fancy nor beautiful. Like most of other algorithms in textbooks, it is beautiful to some degree, but in a module where they taught you algorithms, it was difficult to pay any particular attention to topological sort alone.

Now in retrospect, I think topo sort is the single algorithm from those modules that I have been using most often. Okay, maybe I have been using quick sort more often, but I don’t need to re-implement it (except in interviews, which don’t count).

Imagine you want to run backpropagation in a neural network of multiple layers. It would be easy if your network is a simple chain of layers, but if you allow a layer to have multiple inputs (something like multi-modal nets), then how are you going to do it in a generic way?

Another example is when you have a pipeline of stuff, let it be a business workflow, a data flow, etc… where every operation might depend on other operations. The generic way to figure out an order to execute those operations is via topological sort.

If you have been using Tensorflow, then yes, they implement it for the graphs. In tensorflow, there are many different kinds of connection, an Op can be input to other Ops, or its execution simply needs to be happens before some other Ops (when some variables need to be initialized, for instance), but the general idea doesn’t change.

If there is any textbook algorithm that I am re-implementing again and again, I believe it is topological sort, and I couldn’t be more thankful to the professors who taught me this thing (fundamentally).

In fact, I realize it’s super difficult to fully anticipate the beauty and potential applications of stuff they taught you when you were in school. I am pretty sure most of them are useless (like I got to learn XML databases and object oriented databases somewhere in my education), but some will turn out to be unexpectedly valuable at some point in your career. This applies to many other things you learn outside of school as well. The trick is to know what to learn to maximize the profit in the long run.

Unfortunately I don’t know the answer to that. My strategy so far is to learn as much as I could, from as many people as I could. But we will see how far that will work (although I think it wouldn’t work for too long).

Before you ask, the next most popular textbook algorithm I have been using is state machines. When I was implementing my toy compiler around a decade ago, I didn’t imagine I would need to do it again and again for various kinds of parsers I had been doing.

And of course, Viterbi, I had some good time playing around with various aspects of it.

What about you? What are your favourite textbook algorithms?

Training neural nets with Distributed Tensorflow

The distributed version of Tensorflow has been released, but the documentation is not great. As with other open-source libraries like this, the most effective way is to look into the sample code. In case of tensorflow, the only sample available is the model trained on CIFAR. Technically it isn’t really a sample of the distributed version of tensorflow, it is just training on multiple GPUs. Moreoever, the CIFAR sample is quite obscured, and how to write a program to utilize a tensorflow cluster wasn’t clearly demonstrated.

So this is my attempt to demonstrate that. Check it out and leave me a comment if there is any issue.