Sample Efficient Reinforcement Learning with REINFORCE

Authors: Junzi Zhang, Jongho Kim, Brendan O'Donoghue, Stephen Boyd10887-10895

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical However, prior works have either required exact gradients or state-action visitation measure based mini-batch stochastic gradients with a diverging batch size, which limit their applicability in practical scenarios. In this paper, we consider classical policy gradient methods that compute an approximate gradient with a single trajectory or a fixed size mini-batch of trajectories under soft-max parametrization and log-barrier regularization, along with the widely-used REINFORCE gradient estimation procedure. By controlling the number of bad episodes and resorting to the classical doubling trick, we establish an anytime sub-linear high probability regret bound as well as almost sure global convergence of the average regret with an asymptotically sub-linear rate. These provide the first set of global convergence and sample efficiency results for the wellknown REINFORCE algorithm and contribute to a better understanding of its performance in practice.
Researcher Affiliation Collaboration Junzi Zhang1, Jongho Kim2, Brendan O Donoghue3, Stephen Boyd2 1 Institute for Computational & Mathematical Engineering, Stanford University, USA 2 Department of Electrical Engineering, Stanford University, USA 3 Deep Mind, Google
Pseudocode Yes Algorithm 1 Policy Gradient Method with Single Trajectory Estimates ... Algorithm 2 Phased Policy Gradient Method ... Algorithm 3 Post Process(θ, ϵpp)
Open Source Code No The paper refers to a longer version of the paper online: "Sample efficient reinforcement learning with REINFORCE. Accessed March 23. [Online]. Available: https://stanford.edu/~boyd/papers/conv_reinforce.html." This URL points to the paper itself, not to source code for the methodology described.
Open Datasets No The paper is theoretical and discusses reinforcement learning settings; it does not describe experiments using specific datasets for training.
Dataset Splits No The paper is theoretical and does not describe empirical experiments, thus no dataset splits for validation are mentioned.
Hardware Specification No The paper is theoretical and does not describe any specific hardware used for running experiments.
Software Dependencies No The paper is theoretical and does not list specific software dependencies with version numbers.
Experiment Setup No The paper is theoretical and does not describe an empirical experimental setup with specific hyperparameters or training configurations.