Reinforcement Learning Algorithm Selection
Authors: Romain Laroche, Raphael Feraud
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | ESBAS is first empirically evaluated on a dialogue task where it is shown to outperform each individual algorithm in most configurations. ESBAS is then adapted to a true online setting where algorithms update their policies after each transition, which we call SSBAS. SSBAS is evaluated on a fruit collection task where it is shown to adapt the stepsize parameter more efficiently than the classical hyperbolic decay, and on an Atari game, where it improves the performance by a wide margin. |
| Researcher Affiliation | Industry | 1 Microsoft Research, Montréal, Canada 2 Orange Labs, Lannion, France |
| Pseudocode | Yes | Pseudo-code 1: Online RL AS setting [...] Pseudo-code 2: ESBAS with UCB1 |
| Open Source Code | No | This paper does not contain an explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We investigate here AS for deep RL on the Arcade Learning Environment (ALE, Bellemare et al. (2013)) and more precisely the game Q*bert |
| Dataset Splits | No | This paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) for training, validation, or testing. |
| Hardware Specification | No | This paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions software components like 'Fitted-Q Iteration', 'Q-learning', 'DQN', and 'RMSprop optimizer' but does not provide specific version numbers for these or other ancillary software dependencies. |
| Experiment Setup | Yes | The hyperparameters used for training them are also the same and equal to the ones presented in the table hereinafter: minibatch size 32, replay memory size 1 106, agent history length 4, target network update frequency 5 104, discount factor 0.99, action repeat 20, update frequency 20, learning rate 2.5 10-4, exploration parameter ϵ 5 t 1 10-6, replay start size 5 104, no-op max 30. |