Policy Gradient
Published:
Notes from book “Reinforcement Learning - An Introduction” by Sutton and Barto
Policy Gradient
\begin{equation} V_\theta(s) \approx V^{\pi}(s) \end{equation}
\begin{equation} Q_\theta(s, a) \approx Q^{\pi}(s, a) \end{equation}
Policy based RL: policy generated from value function using $\epsilon$-greedy.
Directly parameterize the policy:
\begin{equation} \pi_{\theta} (s, a) = P\left[ a \mid s; \theta \right] \end{equation}
The goal is to find a policy $\pi$ with the highest value function $V^{\pi}$.
Pros:
- better convergence
- high dimensional
- stochastic policies
Cons:
- convergence to local minimum
- inefficient and high variance
Gradient free optimization
Finite difference policy gradient
Likelihood Ratio Policies
Policy value is \begin{equation} V(\theta) &=& E_{\pi \theta} \left[ \sum_{\tau = 0}^{\tau} R(s_\tau, a_\tau); \pi_{\theta} \right]
&=& \sum_{\tau) P(\tau; \theta) R(\tau) \end{equation}
then to find the optimal policy parameter $\theta$:
\begin{equation} \arg \max_{\theta} V(\theta) = \arg \max_{\theta} \sum_{\tau} P(\tau, \theta) R(\tau) \end{equation}
\begin{equation} g_i = f(x_i) \nabla_\theta \log P(x_i \mid \theta) \end{equation}
where $f(x)$ measures how good the sample $x$ is.