reproducibilityindex.ai

Double Gumbel Q-Learning

Authors: David Yu-Tung Hui, Aaron C. Courville, Pierre-Luc Bacon

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a default value for our pessimism hyperparameter that enables Double Gum to outperform DDPG, TD3, SAC, XQL, quantile regression, and Mixture-of-Gaussian Critics in aggregate over 33 tasks from Deep Mind Control, Mu Jo Co, Meta World, and Box2D and show that tuning this hyperparameter may further improve sample efficiency.
Researcher Affiliation	Academia	David Yu-Tung Hui Mila, Université de Montréal dythui2+drl@gmail.com Aaron Courville Mila, Université de Montréal Pierre-Luc Bacon Mila, Université de Montréal
Pseudocode	Yes	Algorithm 1 presents pseudocode for Double Gum, an off-policy Deep Q-Learning algorithm that learns θ in both discrete and continuous control.
Open Source Code	Yes	1Code: https://github.com/dyth/doublegum
Open Datasets	Yes	Figure 1 shows the evolution of noise during training Double Gum in the classic control task of Cart Pole-v1. ... We benchmarked Double Gum on 33 tasks over 4 continuous control suites comprising respectively of 11 Deep Mind Control (DMC) tasks (Tassa et al., 2018; Tunyasuvunakool et al., 2020), 5 Mu Jo Co tasks (Todorov et al., 2012; Brockman et al., 2016), 15 Meta World tasks (Yu et al., 2020) and 2 Box2D tasks (Brockman et al., 2016).
Dataset Splits	No	The paper does not provide explicit details about training/validation/test dataset splits. It discusses evaluation on tasks and training steps but not a formal data splitting strategy for validation purposes that would be needed for direct reproduction of splits.
Hardware Specification	Yes	A single training run for discrete control may take up to 3 to 5 minutes on a laptop with an Intel Core i9 CPU, NVIDIA 1050 GPU and 31.0 Gi B of RAM. On the same system, a single training run for continuous control takes 1 2 hours. ... This cluster had a mixture of Intel Broadwell, Skylake, Cascade Lake, AMD Rome, AMD Milan CPUs, and NVIDIA P100s, V100s, and A100s GPUs.
Software Dependencies	Yes	All algorithms and code were implemented in JAX (Bradbury et al., 2018) and Haiku (Hennigan et al., 2020) and cpprb (Yamada, 2019).
Experiment Setup	Yes	All networks had two hidden layers of size 256, Re LU activations (Glorot et al., 2011), orthogonal initialization (Saxe et al., 2013) with a gain of √2 for all layers apart from the last layer of the policy and variance head, which had gains of 1. ... All parameters were optimized by Adam (Kingma and Ba, 2014) with default hyperparameters. ... Table 3: Hyperparameters for Discrete Control ... Table 4: Hyperparameters for Continuous Control