[RL4a] Policy Optimization

I thought I would write about some theory behind Reinforcement Learning, with eligibility traces, contraction mapping, POMDP and so on, but then I realized if I go down that rabbit hole, I would probably never finish this. So here are some practical stuff that people are actually using these days.

Two main approaches in model-free RL are policy-based, where we build a model of the policy function, and value-based, where we learn the value function. For many practical problems, the structure of the (unknown) policy function might be easier to learn than the (unknown) value function, in which case it makes more sense to use policy-based methods. Moreover, with policy-based methods, we will have an explicit policy, which is the ultimate goal of Reinforcement learning to control (the other type being learn-to-predict, more on that some other time). With value-based methods, like DQN, we will need to do an additional inference step to get the optimal policy.

The hybrid of policy-based and value-based is called Actor-Critic methods, which hopefully will be mentioned in another post.

One of the most straightforward approach in policy-based RL is, unsurprisingly, evolutionary algorithms. In this approach, a population of policies is maintained and evolutionized over time. People show that this works pretty well, e.g. for the Tetris game. However, due to the randomness, this is apparently only applied to problems where the number of parameters of the policy function is small.

A big part of policy-based methods, however, is based on Policy Gradient, where an exact estimate of the gradient of the expected future reward can be computed. Roughly speaking, there is an exact formulation for the gradient of the policy, which we can then use to optimize the policy. Since Deep Learning people basically worship gradients (pun intended), this method suites very well and became trendy about 2 years ago.

So, what is it all about? It is based on a very simple trick called Likelihood Ratio.

Let’s say we have a sequence of states and actions $\tau = s_0, a_0, s_1, a_1, ..., s_T, a_T$ (called an episode or rollout). Let:

• $R\left(s_t, a_t\right)$ be the immediate reward at timestep after the agent be in state $s_t$ and performs action $a_t$.
• $R\left(\tau\right) = \sum_{t=0}^{T} R\left(s_t, a_t\right)$ be the total reward of the whole trajectory $\tau$
• $P\left(s_{t+1}\vert s_t, a_t\right)$ be the transition model (for presentation purpose only, we won’t need to know this quantity at the end of the day, because we are in model-free RL).
• $\pi_\theta\left(a_t\vert s_t\right)$ be the policy function, which is basically a distribution of actions given state. We will build a model with parameters $\theta$ and learn it from the observed sequences $\tau$‘s.
• $U\left(\theta\right) = \mathbb{E}\left[\sum_{t=0}^{T} R\left(s_t,a_t\right);\pi_\theta\right]$ be the expected future reward.
Using the definition of the expectation, we have $U\left(\theta\right) = \mathbb{E}\left[\sum_{t=0}^{T} R\left(s_t,a_t\right);\pi_\theta\right] = \sum_\tau P\left(\tau;\theta\right)R\left(\tau\right)$, where $P\left(\tau;\theta\right)$ is the probability of trajectory $\tau$ under the policy $\pi_\theta$.

Now, the whole purpose of learning is to estimate the parameters $\theta$ such that the expected reward, in other words:

$\displaystyle \max_\theta U\left(\theta\right) = \max_\theta \sum_\tau P\left(\tau;\theta\right)R\left(\tau\right)$

Just like any good mathematician, we would just compute the derivative of $U\left(\theta\right)$:

$\begin{array}{rl} \nabla_\theta U\left(\theta\right) & = \sum_\tau \nabla_\theta P\left(\tau;\theta\right)R\left(\tau\right) \\ \; & = \sum_\tau P\left(\tau;\theta\right)\frac{\nabla_\theta P\left(\tau;\theta\right)}{P\left(\tau;\theta\right)}R\left(\tau\right) \\ \; & = \sum_\tau P\left(\tau;\theta\right)\nabla_\theta\log P\left(\tau;\theta\right)R\left(\tau\right) \end{array}$

This might look too trivial, but this simple trick makes the right hand side becomes the expectation of the function $\log P\left(\tau;\theta\right)R\left(\tau\right)$ over all trajectories. Using the law of large number, we can approximate it with:

$\displaystyle \nabla_\theta U\left(\theta\right) \approx \frac{1}{m}\sum_{i=1}^m\nabla_\theta\log P\left(\tau^{(i)};\theta\right)R\left(\tau^{(i)}\right)$

During training, what we do is we take the episodes one at a time, compute the above estimation of the gradient (i.e. using $m=1$), and update $\theta$ using this gradient. Since we will be doing stochastic gradient descent anyway, the above estimation still works even with $m=1$.

However, this estimation is extremely noisy (say high-variance), and data-inefficient. It is high variance mainly because of the sum of total rewards $R\left(\tau\right)$ can varies depending on the immediate rewards the agent gets during the whole episode. It is data-inefficient because we can only update the parameters once per episode, therefore we need many episodes for this method to work.

We will see later how to address those issues. For now, let’s finish computing the only missing piece on the right hand side:

$\displaystyle\begin{array}{rl} \nabla_\theta\log P\left(\tau^{(i)};\theta\right) & = \displaystyle \nabla \log\left[\prod_{t=0}^{T-1}P\left(s_{t+1}^{(i)}\vert s_t^{(i)},a_t^{(i)}\right)\pi_\theta\left(a_t^{(i)}\vert s_t^{(i)}\right)\right] \\ \; & = \displaystyle \nabla_\theta\left[\sum_{t=0}^{T-1}\log P\left(s_{t+1}^{(i)}\vert s_t^{(i)},a_t^{(i)}\right) + \sum_{t=0}^{T-1}\log \pi_\theta\left(a_t^{(i)}\vert s_t^{(i)}\right)\right] \\ \; & = \displaystyle \nabla_\theta\sum_{t=0}^{T-1}\log \pi_\theta\left(a_t^{(i)}\vert s_t^{(i)}\right)\\ \; & = \displaystyle \sum_{t=0}^{T-1}\nabla_\theta\log \pi_\theta\left(a_t^{(i)}\vert s_t^{(i)}\right)\end{array}$

We could drop the first part of the sum because that is the transition model, and does not depends on $\theta$. Note that the result is pretty beautiful. Basically it is the sum of the derivatives of log-probability of the policy function over all timesteps in the episode. Plug this into the above function, we will have an estimate of the gradient to be used to update the paramters $\theta$.

In practice, people would use a deep neural net to model the policy function $\pi$. For continuous control problems, the output layer would be continuous. For discrete control problems, the output layer could be Gaussian (for binary controls) or softmax. The parameters of the nets will be updated with gradient ascent, because we want to maximize the objective function $U\left(\theta\right)$.

The annoying thing though, is there is no explicit loss function in this case. We can only compute the derivative, while the true objective function is still intractable. Therefore, monitoring the training process will be quite manual.

In tensorflow, since there is no explicit loss function, we would need to use tf.gradients() and apply_gradients() separately to train the policy net.

There are couple of approaches for reducing the variance of the gradient estimate. One simple way is to use a baseline, which is the expected rewards over each episode. This, combined with a discount rate for the discounted future rewards, is used in the well-known REINFORCE paper. Using a discount rate though, will make the estimate become bias. A rate of $\gamma =1$ gives unbias estimate, but $\gamma < 1$ gives bias estimate because, intuitively, it assumes the rewards of timesteps too far in the future have less effect than the near future.

More sophisticated approaches include Natural Gradients and Trust-Region Policy Optimization (TRPO), where a surrogate objective function is used instead of the $U\left(\theta\right)$ function (more of that in here and here). Deterministic Policy Gradients is also another approach.

Yet another way is to pre-train the policy net to mimic (human) expert behavior in a supervised setting, of course only when such a labelled dataset is available. This is one of the trick used in the AlphaGo paper.

However, the trend nowadays is to use some form of Actor-Critic, which, as David Silver said, is the most general form of RL.