Reinforcement Learning Algorithm Selection

Authors: Romain Laroche, Raphael Feraud

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental ESBAS is first empirically evaluated on a dialogue task where it is shown to outperform each individual algorithm in most configurations. ESBAS is then adapted to a true online setting where algorithms update their policies after each transition, which we call SSBAS. SSBAS is evaluated on a fruit collection task where it is shown to adapt the stepsize parameter more efficiently than the classical hyperbolic decay, and on an Atari game, where it improves the performance by a wide margin.
Researcher Affiliation Industry 1 Microsoft Research, Montréal, Canada 2 Orange Labs, Lannion, France
Pseudocode Yes Pseudo-code 1: Online RL AS setting [...] Pseudo-code 2: ESBAS with UCB1
Open Source Code No This paper does not contain an explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We investigate here AS for deep RL on the Arcade Learning Environment (ALE, Bellemare et al. (2013)) and more precisely the game Q*bert
Dataset Splits No This paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) for training, validation, or testing.
Hardware Specification No This paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions software components like 'Fitted-Q Iteration', 'Q-learning', 'DQN', and 'RMSprop optimizer' but does not provide specific version numbers for these or other ancillary software dependencies.
Experiment Setup Yes The hyperparameters used for training them are also the same and equal to the ones presented in the table hereinafter: minibatch size 32, replay memory size 1 106, agent history length 4, target network update frequency 5 104, discount factor 0.99, action repeat 20, update frequency 20, learning rate 2.5 10-4, exploration parameter ϵ 5 t 1 10-6, replay start size 5 104, no-op max 30.