Policy Gradient

less than 1 minute read


Notes from book “Reinforcement Learning - An Introduction” by Sutton and Barto

Policy Gradient

\begin{equation} V_\theta(s) \approx V^{\pi}(s) \end{equation}

\begin{equation} Q_\theta(s, a) \approx Q^{\pi}(s, a) \end{equation}

Policy based RL: policy generated from value function using $\epsilon$-greedy.

Directly parameterize the policy:

\begin{equation} \pi_{\theta} (s, a) = P\left[ a \mid s; \theta \right] \end{equation}

The goal is to find a policy $\pi$ with the highest value function $V^{\pi}$.


  1. better convergence
  2. high dimensional
  3. stochastic policies


  1. convergence to local minimum
  2. inefficient and high variance

Gradient free optimization

Finite difference policy gradient

Likelihood Ratio Policies

Policy value is \begin{equation} V(\theta) &=& E_{\pi \theta} \left[ \sum_{\tau = 0}^{\tau} R(s_\tau, a_\tau); \pi_{\theta} \right]
&=& \sum_{\tau) P(\tau; \theta) R(\tau) \end{equation}

then to find the optimal policy parameter $\theta$:

\begin{equation} \arg \max_{\theta} V(\theta) = \arg \max_{\theta} \sum_{\tau} P(\tau, \theta) R(\tau) \end{equation}

\begin{equation} g_i = f(x_i) \nabla_\theta \log P(x_i \mid \theta) \end{equation}

where $f(x)$ measures how good the sample $x$ is.