Today I watched lecture 5 from Berkeley’s deep rl course, which covered actor critic algorithms.
The basic idea is that, building off of the policy gradient ideas, we can find ways to better improve the algorithm. In particular, if we consider the policy as an “actor”, we can create a “critic” that helps us determine the relative advantages of each actor. This helps us reduce the variance, which in turn helps train the policy gradient algorithm.
This lecture mainly covered various ways of implementing a critic network including, but not limited to, training methods and the inclusion of discount factors. It also covered more generalized baselines, including n-step returns and GAE.
This mathematical overview was very useful in implementing hw2 of this course, which implemented policy gradient from sctach. The math is pretty cool, and if anyone wants to check it out, my full notes are below: