reproducibilityindex.ai

Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics

Authors: Arsenii Kuznetsov, Pavel Shvechikov, Alexander Grishin, Dmitry Vetrov

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	TQC outperforms the current state of the art on all environments from the continuous control benchmark suite, demonstrating 25% improvement on the most challenging Humanoid environment. We advance the state of the art on the standard continuous control benchmark suite (Section 4) and perform extensive ablation study (Section 5).
Researcher Affiliation	Collaboration	1Samsung AI center, Moscow, Russia 2National Research University Higher School of Economics, Moscow, Russia 3Samsung HSE Laboratory, National Research University Higher School of Economics, Moscow, Russia.
Pseudocode	Yes	Algorithm 1 TQC. ˆ denotes the stochastic gradient Initialize policy πφ, critics Zψn, Zψn for n [1..N] Set replay D = , HT = dim A, α = 1, β = .005 for each iteration do for each environment step, until done do collect transition (st, at, rt, st+1) with policy πφ D D {(st, at, rt, st+1)} end for for each gradient step do sample a batch from the replay D α α λα ˆ αJ(α) Eq. (5) φ φ λπ ˆ φJπ(φ) Eq. (15) ψn ψn λZ ˆ ψn JZ(ψn), n [1..N] Eq. (13) ψn βψn + (1 β)ψn, n [1..N] end for end for return policy πφ, critics Zψn, n [1..N].
Open Source Code	Yes	To facilitate reproducibility, we carefully document the experimental setup, perform exhaustive ablation, average experimental results over a large number of seeds, publish raw data of seed runs, and release the code for Tensorﬂow1 and Py Torch2. 1https://github.com/bayesgroup/tqc 2https://github.com/bayesgroup/tqc_ pytorch
Open Datasets	Yes	Second, we quantitatively compare our method with competitors on a standard continuous control benchmark the set of Mu Jo Co (Todorov et al., 2012) environments implemented in Open AI Gym (Brockman et al., 2016).
Dataset Splits	No	The paper uses continuous control environments (MuJoCo, OpenAI Gym) for reinforcement learning. While it describes training and evaluation procedures, it does not provide specific train/validation/test dataset splits as typically defined for static datasets.
Hardware Specification	Yes	Table 4. Time measurements (in seconds) of a single training epoch (1000 frames), averaged over 1000 epochs, executed on the Tesla P40 GPU.
Software Dependencies	No	The paper mentions 'TensorFlow' and 'Py Torch' as frameworks for which code is released, but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	For all Mu Jo Co experiments, we use N = 5 critic networks with three hidden layers of 512 neurons each, M = 25 atoms, and the best number of dropped atoms per network d [0..5], if not stated otherwise. The other hyperparameters are the same as in SAC (see Appendix B). In this experiment we vary the number of atoms (per network) to drop in the range d [0..5]. The total number of atoms dropped is d N. We ﬁx the number of atoms for each Q-network to M = 25.