reproducibilityindex.ai

Constrained Policy Improvement for Efficient Reinforcement Learning

Authors: Elad Sarafian, Aviv Tamar, Sarit Kraus

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate RBI in two tasks in the Atari Learning Environment: (1) learning from observations of multiple behavior policies and (2) iterative RL. Our results demonstrate the advantage of RBI over greedy policies and other constrained policy optimization algorithms both in learning from observations and in RL tasks.
Researcher Affiliation	Academia	1 Bar-Ilan University, Israel 2 Technion, Israel
Pseudocode	Yes	Algorithm 1 Max-Reroute, Algorithm 2 RBI learner, Algorithm 3 RBI actor
Open Source Code	Yes	The appendix and the source code for the Atari experiments are found at github.com/eladsar/rbi/tree/rbi.
Open Datasets	Yes	To that end, we use a crowdsourced dataset of 4 Atari games (Space Invaders, Ms Pacman, Qbert, and Montezuma s Revenge) [Kurin et al., 2017]
Dataset Splits	No	The paper mentions using a dataset and training, but does not explicitly state the dataset splits for training, validation, and testing.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory) used for experiments are mentioned in the paper.
Software Dependencies	No	No specific software dependencies with version numbers are provided in the paper.
Experiment Setup	Yes	For a fair comparison, we used a batch size of 128 and capped the learning process to 3.125M backward passes... we set (cmin, cmax) = (0.1, 2)... cgreedy = 0.1. The Q-function is learned with Qπ(a) = (1 α)Qπ(a) + αr, where α is a learning rate, possibly decaying over time.