I thought I would write about some theory behind Reinforcement Learning, with eligibility traces, contraction mapping, POMDP and so on, but then I realized if I go down that rabbit hole, I would probably never finish this. So here are some practical stuff that people are actually using these days.

Two main approaches in model-free RL are *policy-based*, where we build a model of the policy function, and *value-based*, where we learn the value function. For many practical problems, the structure of the (unknown) policy function might be easier to learn than the (unknown) value function, in which case it makes more sense to use policy-based methods. Moreover, with policy-based methods, we will have an explicit policy, which is the ultimate goal of Reinforcement learning *to control* (the other type being *learn-to-predict*, more on that some other time). With *value-based *methods, like DQN, we will need to do an additional inference step to get the optimal policy.

The hybrid of *policy-based* and *value-based* is called *Actor-Critic* methods, which hopefully will be mentioned in another post.

One of the most straightforward approach in *policy-based* RL is, unsurprisingly, *evolutionary algorithms.* In this approach, a population of policies is maintained and evolutionized over time. People show that this works pretty well, e.g. for the Tetris game. However, due to the randomness, this is apparently only applied to problems where the number of parameters of the policy function is small.

A big part of policy-based methods, however, is based on *Policy Gradient*, where an exact estimate of the gradient of the expected future reward can be computed. Roughly speaking, there is an exact formulation for the *gradient* of the *policy*, which we can then use to optimize the policy. Since Deep Learning people basically *worship* gradients (pun intended), this method suites very well and became trendy about 2 years ago.

So, what is it all about? It is based on a very simple trick called *Likelihood Ratio.*

(more…)

### Like this:

Like Loading...