Double Gumbel Q-Learning
Authors: David Yu-Tung Hui, Aaron C. Courville, Pierre-Luc Bacon
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a default value for our pessimism hyperparameter that enables Double Gum to outperform DDPG, TD3, SAC, XQL, quantile regression, and Mixture-of-Gaussian Critics in aggregate over 33 tasks from Deep Mind Control, Mu Jo Co, Meta World, and Box2D and show that tuning this hyperparameter may further improve sample efficiency. |
| Researcher Affiliation | Academia | David Yu-Tung Hui Mila, Université de Montréal dythui2+drl@gmail.com Aaron Courville Mila, Université de Montréal Pierre-Luc Bacon Mila, Université de Montréal |
| Pseudocode | Yes | Algorithm 1 presents pseudocode for Double Gum, an off-policy Deep Q-Learning algorithm that learns θ in both discrete and continuous control. |
| Open Source Code | Yes | 1Code: https://github.com/dyth/doublegum |
| Open Datasets | Yes | Figure 1 shows the evolution of noise during training Double Gum in the classic control task of Cart Pole-v1. ... We benchmarked Double Gum on 33 tasks over 4 continuous control suites comprising respectively of 11 Deep Mind Control (DMC) tasks (Tassa et al., 2018; Tunyasuvunakool et al., 2020), 5 Mu Jo Co tasks (Todorov et al., 2012; Brockman et al., 2016), 15 Meta World tasks (Yu et al., 2020) and 2 Box2D tasks (Brockman et al., 2016). |
| Dataset Splits | No | The paper does not provide explicit details about training/validation/test dataset splits. It discusses evaluation on tasks and training steps but not a formal data splitting strategy for validation purposes that would be needed for direct reproduction of splits. |
| Hardware Specification | Yes | A single training run for discrete control may take up to 3 to 5 minutes on a laptop with an Intel Core i9 CPU, NVIDIA 1050 GPU and 31.0 Gi B of RAM. On the same system, a single training run for continuous control takes 1 2 hours. ... This cluster had a mixture of Intel Broadwell, Skylake, Cascade Lake, AMD Rome, AMD Milan CPUs, and NVIDIA P100s, V100s, and A100s GPUs. |
| Software Dependencies | Yes | All algorithms and code were implemented in JAX (Bradbury et al., 2018) and Haiku (Hennigan et al., 2020) and cpprb (Yamada, 2019). |
| Experiment Setup | Yes | All networks had two hidden layers of size 256, Re LU activations (Glorot et al., 2011), orthogonal initialization (Saxe et al., 2013) with a gain of √2 for all layers apart from the last layer of the policy and variance head, which had gains of 1. ... All parameters were optimized by Adam (Kingma and Ba, 2014) with default hyperparameters. ... Table 3: Hyperparameters for Discrete Control ... Table 4: Hyperparameters for Continuous Control |