Deep Laplacian-based Options for Temporally-Extended Exploration

Authors: Martin Klissarov, Marlos C. Machado

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce a fully online deep RL algorithm for discovering Laplacian-based options and evaluate our approach on a variety of pixel-based tasks. We compare to several state-of-the-art exploration methods and show that our approach is effective, general, and especially promising in non-stationary settings.
Researcher Affiliation Collaboration 1Mila, Mc Gill University 2Alberta Machine Intelligence Institute (Amii) 3Department of Computing Science, University of Alberta 4Canada CIFAR AI Chair. *Work done during an internship at Deep Mind. Work mostly done while at Deep Mind.
Pseudocode Yes Algorithm 1 Fully Online DCEO Algorithm
Open Source Code No The paper thanks external individuals for providing code for *components* (generalized Laplacian, Rubik's Cube 2x2 implementation) used in their work, but does not state that *their own methodology's code* is open source or publicly available. There is no explicit statement like 'Our code is available at...'.
Open Datasets Yes We evaluate CEO in environments with different topologies (c.f. Figure 1 and detailed description in Appendix B) while comparing its performance against well-established exploration algorithms. We validate DCEO s efficacy on pixel-based versions of the environments in Figure 1. We perform experiments on the Atari 2600 game Montezuma s Revenge through the Arcade Learning Environment (Bellemare et al., 2013; Machado et al., 2018a) and in the Mini World domain (Chevalier-Boisvert, 2018). For Rubik's cube: We use the open source implementation made available by de Asis (2018).
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits (e.g., specific percentages or sample counts) needed to reproduce the experiment.
Hardware Specification No The paper does not explicitly describe the hardware used to run its experiments with specific details like GPU models, CPU types, or cloud instance specifications.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup Yes For the reward maximization phase, the DCEO agent learns to maximize reward with DDQN and n-step targets (c.f. Eq. 1), and it uses ϵ-greedy as the exploration strategy. The discovered options have an impact when the agent takes an exploratory step: with probability µ the agent does not take only a simple random primitive action, but it instead acts according to a sampled option s policy until it terminates, denoted by τ, thus exploring in a temporally-extended way. We use D = 10 in all experiments. For the experiments on reward maximization, all methods are implemented on top of an n-step Double DQN (DDQN) baseline with n = 5. Details on parameter tuning for each method, and the parameters used, are available in Appendix I. Appendix I explicitly states: For all deep learning experiments we used a step size of 0.0001. The convolutional networks were a two layered convolutional network of channels 32, kernel 3 by 3 and stride 2. This was followed by a fully connected layer of size 256, followed by the outputs of the networks. All activations were Re LUs. For the Rubik s experiments, we used a stacked of 3 fully connected layers of size 256 before the outputs. All activations were Re LUs.