Constrained Policy Improvement for Efficient Reinforcement Learning
Authors: Elad Sarafian, Aviv Tamar, Sarit Kraus
IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate RBI in two tasks in the Atari Learning Environment: (1) learning from observations of multiple behavior policies and (2) iterative RL. Our results demonstrate the advantage of RBI over greedy policies and other constrained policy optimization algorithms both in learning from observations and in RL tasks. |
| Researcher Affiliation | Academia | 1 Bar-Ilan University, Israel 2 Technion, Israel |
| Pseudocode | Yes | Algorithm 1 Max-Reroute, Algorithm 2 RBI learner, Algorithm 3 RBI actor |
| Open Source Code | Yes | The appendix and the source code for the Atari experiments are found at github.com/eladsar/rbi/tree/rbi. |
| Open Datasets | Yes | To that end, we use a crowdsourced dataset of 4 Atari games (Space Invaders, Ms Pacman, Qbert, and Montezuma s Revenge) [Kurin et al., 2017] |
| Dataset Splits | No | The paper mentions using a dataset and training, but does not explicitly state the dataset splits for training, validation, and testing. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) used for experiments are mentioned in the paper. |
| Software Dependencies | No | No specific software dependencies with version numbers are provided in the paper. |
| Experiment Setup | Yes | For a fair comparison, we used a batch size of 128 and capped the learning process to 3.125M backward passes... we set (cmin, cmax) = (0.1, 2)... cgreedy = 0.1. The Q-function is learned with Qπ(a) = (1 α)Qπ(a) + αr, where α is a learning rate, possibly decaying over time. |