Policy Gradient

less than 1 minute read

Published:

Notes from book “Reinforcement Learning - An Introduction” by Sutton and Barto

Policy Gradient

\begin{equation} V_\theta(s) \approx V^{\pi}(s) \end{equation}

\begin{equation} Q_\theta(s, a) \approx Q^{\pi}(s, a) \end{equation}

Policy based RL: policy generated from value function using $\epsilon$-greedy.

Directly parameterize the policy:

\begin{equation} \pi_{\theta} (s, a) = P\left[ a \mid s; \theta \right] \end{equation}

The goal is to find a policy $\pi$ with the highest value function $V^{\pi}$.

Pros:

  1. better convergence
  2. high dimensional
  3. stochastic policies

Cons:

  1. convergence to local minimum
  2. inefficient and high variance

Gradient free optimization

Finite difference policy gradient

Likelihood Ratio Policies

Policy value is \begin{equation} V(\theta) &=& E_{\pi \theta} \left[ \sum_{\tau = 0}^{\tau} R(s_\tau, a_\tau); \pi_{\theta} \right]
&=& \sum_{\tau) P(\tau; \theta) R(\tau) \end{equation}

then to find the optimal policy parameter $\theta$:

\begin{equation} \arg \max_{\theta} V(\theta) = \arg \max_{\theta} \sum_{\tau} P(\tau, \theta) R(\tau) \end{equation}

\begin{equation} g_i = f(x_i) \nabla_\theta \log P(x_i \mid \theta) \end{equation}

where $f(x)$ measures how good the sample $x$ is.