Variational Policy Gradient Method for Reinforcement Learning with General Utilities

Authors: Junyu Zhang, Alec Koppel, Amrit Singh Bedi, Csaba Szepesvari, Mengdi Wang

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we consider policy optimization in Markov Decision Problems, where the objective is a general concave utility function of the state-action occupancy measure, which subsumes several of the aforementioned examples as special cases. Such generality invalidates the Bellman equation. As this means that dynamic programming no longer works, we focus on direct policy search. Analogously to the Policy Gradient Theorem [44] available for RL with cumulative rewards, we derive a new Variational Policy Gradient Theorem for RL with general utilities, which establishes that the parametrized policy gradient may be obtained as the solution of a stochastic saddle point problem involving the Fenchel dual of the utility function. We develop a variational Monte Carlo gradient estimation algorithm to compute the policy gradient based on sample paths. We prove that the variational policy gradient scheme converges globally to the optimal policy for the general objective, though the optimization problem is nonconvex. We also establish its rate of convergence of the order O(1/t) by exploiting the hidden convexity of the problem, and proves that it converges exponentially when the problem admits hidden strong convexity. Our analysis applies to the standard RL problem with cumulative rewards as a special case, in which case our result improves the available convergence rate.
Researcher Affiliation Collaboration Junyu Zhang Department of Electrical Engineering Center for Statistics and Machine Learning Princeton University, Princeton, NJ 08544 junyuz@princeton.edu Alec Koppel CISD US Army Research Laboratory Adelphi, MD 20783 alec.e.koppel.civ@mail.mil Amrit Singh Bedi CISD US Army Research Laboratory Adelphi, MD 20783 amrit0714@gmail.com Csaba Szepesvári Department of Computer Science Deep Mind/University of Alberta szepesva@ualberta.ca Mengdi Wang Department of Electrical Engineering Center for Statistics and Machine Learning Princeton University/Deepmind, Princeton, NJ 08544 mengdiw@princeton.edu
Pseudocode Yes A MC stochastic approximation scheme, i.e., Algorithm 1, is provided in Appendix B.1 of the supplementary material.
Open Source Code No The paper does not provide a concrete access link or explicit statement about the availability of its source code.
Open Datasets Yes Now we shift to numerically validating our methods and theory on Open AI Frozen Lake [11].
Dataset Splits No The paper does not explicitly provide specific training/test/validation dataset splits or percentages, only mentioning 'test trajectories'.
Hardware Specification No The paper does not provide any specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers needed to replicate the experiments.
Experiment Setup Yes PG Ascent for Maximal Entropy Exploration. Next, we consider maximum entropy exploration (7) using algorithm (16), with softmax parametrization. ... PG Ascent for Avoiding Obstacles. ... We consider imposing penalties to avoid costly states [cf. (6)] via a logarithmic barrier (20), and by applying variational PG ascent, we obtain an optimal policy whose resulting occupancy measure is depicted in Fig. 3(a)(top). ... for different penalty parameters β.