Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Constrained Policy Improvement for Efficient Reinforcement Learning
Authors: Elad Sarafian, Aviv Tamar, Sarit Kraus
IJCAI 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate RBI in two tasks in the Atari Learning Environment: (1) learning from observations of multiple behavior policies and (2) iterative RL. Our results demonstrate the advantage of RBI over greedy policies and other constrained policy optimization algorithms both in learning from observations and in RL tasks. |
| Researcher Affiliation | Academia | 1 Bar-Ilan University, Israel 2 Technion, Israel |
| Pseudocode | Yes | Algorithm 1 Max-Reroute, Algorithm 2 RBI learner, Algorithm 3 RBI actor |
| Open Source Code | Yes | The appendix and the source code for the Atari experiments are found at github.com/eladsar/rbi/tree/rbi. |
| Open Datasets | Yes | To that end, we use a crowdsourced dataset of 4 Atari games (Space Invaders, Ms Pacman, Qbert, and Montezuma s Revenge) [Kurin et al., 2017] |
| Dataset Splits | No | The paper mentions using a dataset and training, but does not explicitly state the dataset splits for training, validation, and testing. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) used for experiments are mentioned in the paper. |
| Software Dependencies | No | No specific software dependencies with version numbers are provided in the paper. |
| Experiment Setup | Yes | For a fair comparison, we used a batch size of 128 and capped the learning process to 3.125M backward passes... we set (cmin, cmax) = (0.1, 2)... cgreedy = 0.1. The Q-function is learned with Qπ(a) = (1 α)Qπ(a) + αr, where α is a learning rate, possibly decaying over time. |