reproducibilityindex.ai

Reinforcement Learning Algorithm Selection

Authors: Romain Laroche, Raphael Feraud

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	ESBAS is ﬁrst empirically evaluated on a dialogue task where it is shown to outperform each individual algorithm in most conﬁgurations. ESBAS is then adapted to a true online setting where algorithms update their policies after each transition, which we call SSBAS. SSBAS is evaluated on a fruit collection task where it is shown to adapt the stepsize parameter more efﬁciently than the classical hyperbolic decay, and on an Atari game, where it improves the performance by a wide margin.
Researcher Affiliation	Industry	1 Microsoft Research, Montréal, Canada 2 Orange Labs, Lannion, France
Pseudocode	Yes	Pseudo-code 1: Online RL AS setting [...] Pseudo-code 2: ESBAS with UCB1
Open Source Code	No	This paper does not contain an explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	We investigate here AS for deep RL on the Arcade Learning Environment (ALE, Bellemare et al. (2013)) and more precisely the game Q*bert
Dataset Splits	No	This paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) for training, validation, or testing.
Hardware Specification	No	This paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper mentions software components like 'Fitted-Q Iteration', 'Q-learning', 'DQN', and 'RMSprop optimizer' but does not provide specific version numbers for these or other ancillary software dependencies.
Experiment Setup	Yes	The hyperparameters used for training them are also the same and equal to the ones presented in the table hereinafter: minibatch size 32, replay memory size 1 106, agent history length 4, target network update frequency 5 104, discount factor 0.99, action repeat 20, update frequency 20, learning rate 2.5 10-4, exploration parameter ϵ 5 t 1 10-6, replay start size 5 104, no-op max 30.