Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics

Authors: Arsenii Kuznetsov, Pavel Shvechikov, Alexander Grishin, Dmitry Vetrov

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental TQC outperforms the current state of the art on all environments from the continuous control benchmark suite, demonstrating 25% improvement on the most challenging Humanoid environment. We advance the state of the art on the standard continuous control benchmark suite (Section 4) and perform extensive ablation study (Section 5).
Researcher Affiliation Collaboration 1Samsung AI center, Moscow, Russia 2National Research University Higher School of Economics, Moscow, Russia 3Samsung HSE Laboratory, National Research University Higher School of Economics, Moscow, Russia.
Pseudocode Yes Algorithm 1 TQC. ˆ denotes the stochastic gradient Initialize policy πφ, critics Zψn, Zψn for n [1..N] Set replay D = , HT = dim A, α = 1, β = .005 for each iteration do for each environment step, until done do collect transition (st, at, rt, st+1) with policy πφ D D {(st, at, rt, st+1)} end for for each gradient step do sample a batch from the replay D α α λα ˆ αJ(α) Eq. (5) φ φ λπ ˆ φJπ(φ) Eq. (15) ψn ψn λZ ˆ ψn JZ(ψn), n [1..N] Eq. (13) ψn βψn + (1 β)ψn, n [1..N] end for end for return policy πφ, critics Zψn, n [1..N].
Open Source Code Yes To facilitate reproducibility, we carefully document the experimental setup, perform exhaustive ablation, average experimental results over a large number of seeds, publish raw data of seed runs, and release the code for Tensorflow1 and Py Torch2. 1https://github.com/bayesgroup/tqc 2https://github.com/bayesgroup/tqc_ pytorch
Open Datasets Yes Second, we quantitatively compare our method with competitors on a standard continuous control benchmark the set of Mu Jo Co (Todorov et al., 2012) environments implemented in Open AI Gym (Brockman et al., 2016).
Dataset Splits No The paper uses continuous control environments (MuJoCo, OpenAI Gym) for reinforcement learning. While it describes training and evaluation procedures, it does not provide specific train/validation/test dataset splits as typically defined for static datasets.
Hardware Specification Yes Table 4. Time measurements (in seconds) of a single training epoch (1000 frames), averaged over 1000 epochs, executed on the Tesla P40 GPU.
Software Dependencies No The paper mentions 'TensorFlow' and 'Py Torch' as frameworks for which code is released, but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes For all Mu Jo Co experiments, we use N = 5 critic networks with three hidden layers of 512 neurons each, M = 25 atoms, and the best number of dropped atoms per network d [0..5], if not stated otherwise. The other hyperparameters are the same as in SAC (see Appendix B). In this experiment we vary the number of atoms (per network) to drop in the range d [0..5]. The total number of atoms dropped is d N. We fix the number of atoms for each Q-network to M = 25.