Constrained Policy Improvement for Efficient Reinforcement Learning

Authors: Elad Sarafian, Aviv Tamar, Sarit Kraus

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate RBI in two tasks in the Atari Learning Environment: (1) learning from observations of multiple behavior policies and (2) iterative RL. Our results demonstrate the advantage of RBI over greedy policies and other constrained policy optimization algorithms both in learning from observations and in RL tasks.
Researcher Affiliation Academia 1 Bar-Ilan University, Israel 2 Technion, Israel
Pseudocode Yes Algorithm 1 Max-Reroute, Algorithm 2 RBI learner, Algorithm 3 RBI actor
Open Source Code Yes The appendix and the source code for the Atari experiments are found at github.com/eladsar/rbi/tree/rbi.
Open Datasets Yes To that end, we use a crowdsourced dataset of 4 Atari games (Space Invaders, Ms Pacman, Qbert, and Montezuma s Revenge) [Kurin et al., 2017]
Dataset Splits No The paper mentions using a dataset and training, but does not explicitly state the dataset splits for training, validation, and testing.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for experiments are mentioned in the paper.
Software Dependencies No No specific software dependencies with version numbers are provided in the paper.
Experiment Setup Yes For a fair comparison, we used a batch size of 128 and capped the learning process to 3.125M backward passes... we set (cmin, cmax) = (0.1, 2)... cgreedy = 0.1. The Q-function is learned with Qπ(a) = (1 α)Qπ(a) + αr, where α is a learning rate, possibly decaying over time.