Adaptive Ensemble Q-learning: Minimizing Estimation Bias via Error Feedback
Authors: Hang Wang, Sen Lin, Junshan Zhang
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments are carried out to show that Ada EQ can improve the learning performance than the existing methods for the Mu Jo Co benchmark. |
| Researcher Affiliation | Academia | Hang Wang Arizona State University Tempe, Arizona, USA hwang442@asu.edu Sen Lin Arizona State University Tempe, Arizona, USA slin70@asu.edu Junshan Zhang Arizona State University Tempe, Arizona, USA Junshan.Zhang@asu.edu |
| Pseudocode | Yes | Algorithm 1 Adaptive Ensemble Q-learning (Ada EQ) |
| Open Source Code | Yes | Our training code and training logs will be available at https://github.com/ ustcmike/Ada EQ_Neur IPS21 |
| Open Datasets | Yes | To make a fair comparison, we follow the setup of [6] and use the same code base to compare the performance of Ada EQ with REDQ [6] and Average-DQN (AVG) [2], on three Mu Jo Co continuous control tasks: Hopper, Ant and Walker2d. ... Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026 5033. IEEE, 2012. |
| Dataset Splits | No | The paper describes running "evaluation episodes" and "testing trajectories" for performance assessment, but it does not specify traditional training/validation/test dataset splits with percentages or counts, which is common in static dataset-based machine learning. |
| Hardware Specification | Yes | We conduct all experiments using an NVIDIA GeForce RTX 3090 GPU and an Intel(R) Core(TM) i9-10900K CPU. |
| Software Dependencies | Yes | We use PyTorch (version 1.8.1) for implementing the deep neural networks. |
| Experiment Setup | Yes | The same hyperparameters are used for all the algorithms. Specifically, we consider N = 10 Q-function approximators in total. The ensemble size M = N = 10 for AVG, while the initial M for Ada EQ is set as 4. The ensemble size for REDQ is set as M = 2, which is the fine-tuned result from [6]. For all the experiments, we set the tolerance parameter c in (10) as 0.3 and the length of the testing trajectories as H = 500. The ensemble size is updated according to (10) every 10 epochs in Ada EQ. The discount factor is 0.99. ... The actor and critic networks use 2 hidden layers with 256 units and ReLU activations. We use the Adam optimizer with a learning rate of 3e-4. The replay buffer size is 1e6 and the batch size is 256. The policy is updated every 20 gradient steps. The target networks are updated with polyak averaging with a parameter of 0.995. The exploration noise is decayed from 1 to 0.1 over 100000 steps. |