reproducibilityindex.ai

Sample Efficient Reinforcement Learning with REINFORCE

Authors: Junzi Zhang, Jongho Kim, Brendan O'Donoghue, Stephen Boyd10887-10895

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	However, prior works have either required exact gradients or state-action visitation measure based mini-batch stochastic gradients with a diverging batch size, which limit their applicability in practical scenarios. In this paper, we consider classical policy gradient methods that compute an approximate gradient with a single trajectory or a ﬁxed size mini-batch of trajectories under soft-max parametrization and log-barrier regularization, along with the widely-used REINFORCE gradient estimation procedure. By controlling the number of bad episodes and resorting to the classical doubling trick, we establish an anytime sub-linear high probability regret bound as well as almost sure global convergence of the average regret with an asymptotically sub-linear rate. These provide the ﬁrst set of global convergence and sample efﬁciency results for the wellknown REINFORCE algorithm and contribute to a better understanding of its performance in practice.
Researcher Affiliation	Collaboration	Junzi Zhang1, Jongho Kim2, Brendan O Donoghue3, Stephen Boyd2 1 Institute for Computational & Mathematical Engineering, Stanford University, USA 2 Department of Electrical Engineering, Stanford University, USA 3 Deep Mind, Google
Pseudocode	Yes	Algorithm 1 Policy Gradient Method with Single Trajectory Estimates ... Algorithm 2 Phased Policy Gradient Method ... Algorithm 3 Post Process(θ, ϵpp)
Open Source Code	No	The paper refers to a longer version of the paper online: "Sample efﬁcient reinforcement learning with REINFORCE. Accessed March 23. [Online]. Available: https://stanford.edu/~boyd/papers/conv_reinforce.html." This URL points to the paper itself, not to source code for the methodology described.
Open Datasets	No	The paper is theoretical and discusses reinforcement learning settings; it does not describe experiments using specific datasets for training.
Dataset Splits	No	The paper is theoretical and does not describe empirical experiments, thus no dataset splits for validation are mentioned.
Hardware Specification	No	The paper is theoretical and does not describe any specific hardware used for running experiments.
Software Dependencies	No	The paper is theoretical and does not list specific software dependencies with version numbers.
Experiment Setup	No	The paper is theoretical and does not describe an empirical experimental setup with specific hyperparameters or training configurations.