Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics
Authors: Arsenii Kuznetsov, Pavel Shvechikov, Alexander Grishin, Dmitry Vetrov
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | TQC outperforms the current state of the art on all environments from the continuous control benchmark suite, demonstrating 25% improvement on the most challenging Humanoid environment. We advance the state of the art on the standard continuous control benchmark suite (Section 4) and perform extensive ablation study (Section 5). |
| Researcher Affiliation | Collaboration | 1Samsung AI center, Moscow, Russia 2National Research University Higher School of Economics, Moscow, Russia 3Samsung HSE Laboratory, National Research University Higher School of Economics, Moscow, Russia. |
| Pseudocode | Yes | Algorithm 1 TQC. ˆ denotes the stochastic gradient Initialize policy πφ, critics Zψn, Zψn for n [1..N] Set replay D = , HT = dim A, α = 1, β = .005 for each iteration do for each environment step, until done do collect transition (st, at, rt, st+1) with policy πφ D D {(st, at, rt, st+1)} end for for each gradient step do sample a batch from the replay D α α λα ˆ αJ(α) Eq. (5) φ φ λπ ˆ φJπ(φ) Eq. (15) ψn ψn λZ ˆ ψn JZ(ψn), n [1..N] Eq. (13) ψn βψn + (1 β)ψn, n [1..N] end for end for return policy πφ, critics Zψn, n [1..N]. |
| Open Source Code | Yes | To facilitate reproducibility, we carefully document the experimental setup, perform exhaustive ablation, average experimental results over a large number of seeds, publish raw data of seed runs, and release the code for Tensorflow1 and Py Torch2. 1https://github.com/bayesgroup/tqc 2https://github.com/bayesgroup/tqc_ pytorch |
| Open Datasets | Yes | Second, we quantitatively compare our method with competitors on a standard continuous control benchmark the set of Mu Jo Co (Todorov et al., 2012) environments implemented in Open AI Gym (Brockman et al., 2016). |
| Dataset Splits | No | The paper uses continuous control environments (MuJoCo, OpenAI Gym) for reinforcement learning. While it describes training and evaluation procedures, it does not provide specific train/validation/test dataset splits as typically defined for static datasets. |
| Hardware Specification | Yes | Table 4. Time measurements (in seconds) of a single training epoch (1000 frames), averaged over 1000 epochs, executed on the Tesla P40 GPU. |
| Software Dependencies | No | The paper mentions 'TensorFlow' and 'Py Torch' as frameworks for which code is released, but it does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | For all Mu Jo Co experiments, we use N = 5 critic networks with three hidden layers of 512 neurons each, M = 25 atoms, and the best number of dropped atoms per network d [0..5], if not stated otherwise. The other hyperparameters are the same as in SAC (see Appendix B). In this experiment we vary the number of atoms (per network) to drop in the range d [0..5]. The total number of atoms dropped is d N. We fix the number of atoms for each Q-network to M = 25. |