Deep Laplacian-based Options for Temporally-Extended Exploration
Authors: Martin Klissarov, Marlos C. Machado
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce a fully online deep RL algorithm for discovering Laplacian-based options and evaluate our approach on a variety of pixel-based tasks. We compare to several state-of-the-art exploration methods and show that our approach is effective, general, and especially promising in non-stationary settings. |
| Researcher Affiliation | Collaboration | 1Mila, Mc Gill University 2Alberta Machine Intelligence Institute (Amii) 3Department of Computing Science, University of Alberta 4Canada CIFAR AI Chair. *Work done during an internship at Deep Mind. Work mostly done while at Deep Mind. |
| Pseudocode | Yes | Algorithm 1 Fully Online DCEO Algorithm |
| Open Source Code | No | The paper thanks external individuals for providing code for *components* (generalized Laplacian, Rubik's Cube 2x2 implementation) used in their work, but does not state that *their own methodology's code* is open source or publicly available. There is no explicit statement like 'Our code is available at...'. |
| Open Datasets | Yes | We evaluate CEO in environments with different topologies (c.f. Figure 1 and detailed description in Appendix B) while comparing its performance against well-established exploration algorithms. We validate DCEO s efficacy on pixel-based versions of the environments in Figure 1. We perform experiments on the Atari 2600 game Montezuma s Revenge through the Arcade Learning Environment (Bellemare et al., 2013; Machado et al., 2018a) and in the Mini World domain (Chevalier-Boisvert, 2018). For Rubik's cube: We use the open source implementation made available by de Asis (2018). |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits (e.g., specific percentages or sample counts) needed to reproduce the experiment. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used to run its experiments with specific details like GPU models, CPU types, or cloud instance specifications. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | For the reward maximization phase, the DCEO agent learns to maximize reward with DDQN and n-step targets (c.f. Eq. 1), and it uses ϵ-greedy as the exploration strategy. The discovered options have an impact when the agent takes an exploratory step: with probability µ the agent does not take only a simple random primitive action, but it instead acts according to a sampled option s policy until it terminates, denoted by τ, thus exploring in a temporally-extended way. We use D = 10 in all experiments. For the experiments on reward maximization, all methods are implemented on top of an n-step Double DQN (DDQN) baseline with n = 5. Details on parameter tuning for each method, and the parameters used, are available in Appendix I. Appendix I explicitly states: For all deep learning experiments we used a step size of 0.0001. The convolutional networks were a two layered convolutional network of channels 32, kernel 3 by 3 and stride 2. This was followed by a fully connected layer of size 256, followed by the outputs of the networks. All activations were Re LUs. For the Rubik s experiments, we used a stacked of 3 fully connected layers of size 256 before the outputs. All activations were Re LUs. |