Adaptive Ensemble Q-learning: Minimizing Estimation Bias via Error Feedback

Authors: Hang Wang, Sen Lin, Junshan Zhang

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments are carried out to show that Ada EQ can improve the learning performance than the existing methods for the Mu Jo Co benchmark.
Researcher Affiliation Academia Hang Wang Arizona State University Tempe, Arizona, USA hwang442@asu.edu Sen Lin Arizona State University Tempe, Arizona, USA slin70@asu.edu Junshan Zhang Arizona State University Tempe, Arizona, USA Junshan.Zhang@asu.edu
Pseudocode Yes Algorithm 1 Adaptive Ensemble Q-learning (Ada EQ)
Open Source Code Yes Our training code and training logs will be available at https://github.com/ ustcmike/Ada EQ_Neur IPS21
Open Datasets Yes To make a fair comparison, we follow the setup of [6] and use the same code base to compare the performance of Ada EQ with REDQ [6] and Average-DQN (AVG) [2], on three Mu Jo Co continuous control tasks: Hopper, Ant and Walker2d. ... Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026 5033. IEEE, 2012.
Dataset Splits No The paper describes running "evaluation episodes" and "testing trajectories" for performance assessment, but it does not specify traditional training/validation/test dataset splits with percentages or counts, which is common in static dataset-based machine learning.
Hardware Specification Yes We conduct all experiments using an NVIDIA GeForce RTX 3090 GPU and an Intel(R) Core(TM) i9-10900K CPU.
Software Dependencies Yes We use PyTorch (version 1.8.1) for implementing the deep neural networks.
Experiment Setup Yes The same hyperparameters are used for all the algorithms. Specifically, we consider N = 10 Q-function approximators in total. The ensemble size M = N = 10 for AVG, while the initial M for Ada EQ is set as 4. The ensemble size for REDQ is set as M = 2, which is the fine-tuned result from [6]. For all the experiments, we set the tolerance parameter c in (10) as 0.3 and the length of the testing trajectories as H = 500. The ensemble size is updated according to (10) every 10 epochs in Ada EQ. The discount factor is 0.99. ... The actor and critic networks use 2 hidden layers with 256 units and ReLU activations. We use the Adam optimizer with a learning rate of 3e-4. The replay buffer size is 1e6 and the batch size is 256. The policy is updated every 20 gradient steps. The target networks are updated with polyak averaging with a parameter of 0.995. The exploration noise is decayed from 1 to 0.1 over 100000 steps.