SIAM Journal on Control and Optimization, Vol.58, No.6, 3586-3612, 2020
GLOBAL CONVERGENCE OF POLICY GRADIENT METHODS TO (ALMOST) LOCALLY OPTIMAL POLICIES
Policy gradient (PG) methods have been one of the most essential ingredients of reinforcement learning, with application in a variety of domains. In spite of the empirical success, a rigorous understanding of the global convergence of PG methods appears to be relatively lacking in the literature, especially for the infinite-horizon setting with discounted factors. In this work, we close the gap by viewing PG methods from a nonconvex optimization perspective. In particular, we propose a new variant of PG methods for infinite-horizon problems that uses a random rollout horizon for the Monte Carlo estimation of the policy gradient. This method then yields an unbiased estimate of the policy gradient with bounded variance, which enables using the tools from nonconvex optimization to establish the global convergence. Employing this perspective, we first point to an alternative method to recover the convergence to stationary-point policies in the literature. Motivated by the recent advances in nonconvex optimization, we have modified the proposed PG method by introducing a periodically enlarged stepsize rule. More interestingly, this modified algorithm is shown to be able to escape saddle points under mild assumptions on the reward functions and the policy parameterization of the reinforcement learning (RL) problem. Specifically, we connect the correlated negative curvature condition of [H. Daneshmand et al., Escaping saddles with stochastic gradients, in Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 2018, pp. 1155-1164] to the fact that the reward must be strictly positive or negative. Under the additional assumption that all saddle points are strict, this result essentially establishes the convergence to actual locally optimal policies of the underlying problem and thus rigorously corroborates the overclaimed argument in the literature on the convergence of PG methods. In this aspect, our findings justify the benefit of reward-reshaping in terms of escaping saddle points from a nonconvex optimization perspective.