Optimistic Exploration even with a Pessimistic Initialisation

Authors: Tabish Rashid, Bei Peng, Wendelin Boehmer, Shimon Whiteson

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare OPIQ against baselines and ablations on three sparse reward environments. The first is a randomized version of the Chain environment proposed by Osband et al. (2016) and used in (Shyam et al., 2019) with a chain of length 100, which we call Randomised Chain. The second is a two-dimensional maze in which the agent starts in the top left corner (white dot) and is only rewarded upon finding the goal (light grey dot). The third is Montezuma s Revenge from Arcade Learning environment (Bellemare et al., 2013), a notoriously difficult sparse reward environment commonly used as a benchmark to evaluate the performance and scaling of Deep RL exploration algorithms.
Researcher Affiliation Academia Tabish Rashid, Bei Peng, Wendelin Böhmer, Shimon Whiteson University of Oxford Department of Computer Science {tabish.rashid, bei.peng, wendelin.boehmer, shimon.whiteson}@cs.ox.ac.uk
Pseudocode Yes Algorithm 1 OPIQ algorithm
Open Source Code Yes Code is available at: https://github.com/oxwhirl/opiq.
Open Datasets Yes The third is Montezuma s Revenge from Arcade Learning environment (Bellemare et al., 2013), a notoriously difficult sparse reward environment commonly used as a benchmark to evaluate the performance and scaling of Deep RL exploration algorithms.
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning. It mentions training durations and batch sizes: “Training lasts for 100k timesteps. ϵ is fixed at 0.01 for all methods except for ϵ-greedy DQN in which it is linearly decayed from 1 to 0.01 over {100, 50k, 100k} timesteps. We train on a batch size of 64 after every timestep with a replay buffer of size 10k.”
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. It mentions a general equipment grant: “The experiments were made possible by a generous equipment grant from NVIDIA and the JP Morgan Chase Faculty Research Award.”
Software Dependencies No The paper does not provide specific ancillary software details with version numbers. It mentions the use of RMSProp: “In all experiments we set γ = 0.99, use RMSProp with a learning rate of 0.0005 and scale the gradient norms during training to be at most 5.”
Experiment Setup Yes In all experiments we set γ = 0.99, use RMSProp with a learning rate of 0.0005 and scale the gradient norms during training to be at most 5. Training lasts for 100k timesteps. ϵ is fixed at 0.01 for all methods except for ϵ-greedy DQN in which it is linearly decayed from 1 to 0.01 over {100, 50k, 100k} timesteps. We train on a batch size of 64 after every timestep with a replay buffer of size 10k. The target network is updated every 200 timesteps. The embedding size used for the counts is 32. We set β = 0.1 for the scale of the count-based intrinsic motivation.