Sample Efficient Reinforcement Learning with REINFORCE
Authors: Junzi Zhang, Jongho Kim, Brendan O'Donoghue, Stephen Boyd10887-10895
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | However, prior works have either required exact gradients or state-action visitation measure based mini-batch stochastic gradients with a diverging batch size, which limit their applicability in practical scenarios. In this paper, we consider classical policy gradient methods that compute an approximate gradient with a single trajectory or a fixed size mini-batch of trajectories under soft-max parametrization and log-barrier regularization, along with the widely-used REINFORCE gradient estimation procedure. By controlling the number of bad episodes and resorting to the classical doubling trick, we establish an anytime sub-linear high probability regret bound as well as almost sure global convergence of the average regret with an asymptotically sub-linear rate. These provide the first set of global convergence and sample efficiency results for the wellknown REINFORCE algorithm and contribute to a better understanding of its performance in practice. |
| Researcher Affiliation | Collaboration | Junzi Zhang1, Jongho Kim2, Brendan O Donoghue3, Stephen Boyd2 1 Institute for Computational & Mathematical Engineering, Stanford University, USA 2 Department of Electrical Engineering, Stanford University, USA 3 Deep Mind, Google |
| Pseudocode | Yes | Algorithm 1 Policy Gradient Method with Single Trajectory Estimates ... Algorithm 2 Phased Policy Gradient Method ... Algorithm 3 Post Process(θ, ϵpp) |
| Open Source Code | No | The paper refers to a longer version of the paper online: "Sample efficient reinforcement learning with REINFORCE. Accessed March 23. [Online]. Available: https://stanford.edu/~boyd/papers/conv_reinforce.html." This URL points to the paper itself, not to source code for the methodology described. |
| Open Datasets | No | The paper is theoretical and discusses reinforcement learning settings; it does not describe experiments using specific datasets for training. |
| Dataset Splits | No | The paper is theoretical and does not describe empirical experiments, thus no dataset splits for validation are mentioned. |
| Hardware Specification | No | The paper is theoretical and does not describe any specific hardware used for running experiments. |
| Software Dependencies | No | The paper is theoretical and does not list specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and does not describe an empirical experimental setup with specific hyperparameters or training configurations. |