A Self-Tuning Actor-Critic Algorithm
Authors: Tom Zahavy, Zhongwen Xu, Vivek Veeriah, Matteo Hessel, Junhyuk Oh, Hado P. van Hasselt, David Silver, Satinder Singh
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | When applied to the Arcade Learning Environment (Bellemare et al. 2012), STAC improved the median human normalized score in 200M steps from 243% to 364%. When applied to the DM Control suite (Tassa et al., 2018), STAC improved the mean score in 30M steps from 217 to 389 when learning with features, from 108 to 202 when learning from pixels, and from 195 to 295 in the Real-World Reinforcement Learning Challenge (Dulac-Arnold et al., 2020). |
| Researcher Affiliation | Industry | Tom Zahavy, Zhongwen Xu, Vivek Veeriah, Matteo Hessel, Junhyuk Oh, Hado van Hasselt, David Silver and Satinder Singh Deepmind {tomzahavy,zhongwen,vveeriah,mtthss,junhyuk,hado,davidsilver,baveja}@google.com |
| Pseudocode | Yes | The exact details can be found in the supplementary (Algorithm 2, line 11). |
| Open Source Code | No | The paper cites various third-party libraries and frameworks (e.g., JAX, RLax, Haiku, Optax) with their respective URLs, but does not provide a direct link or explicit statement for the open-sourcing of the STAC/STACX implementation described in the paper. |
| Open Datasets | Yes | When applied to the Arcade Learning Environment (Bellemare et al., 2013, ALE)... When applied to the DM Control suite (Tassa et al., 2018)... |
| Dataset Splits | No | The paper discusses evaluation using median human normalized scores after a certain number of frames and averaging over seeds, which is typical for RL environments, but does not explicitly provide dataset split percentages (e.g., train/validation/test splits) in the traditional supervised learning sense for reproducibility. |
| Hardware Specification | No | The paper states 'does not require a significant increase in compute (see Table 4 in the supplementary and the discussion that follows it)', implying details might be in the supplementary material. However, the provided main paper text does not specify any particular hardware (e.g., CPU/GPU models, RAM, or specific TPU versions) used for the experiments. |
| Software Dependencies | No | The paper mentions software like 'JAX (Bradbury et al., 2018)', 'RLax (Budden et al., 2020)', 'Haiku (Hennigan et al., 2020)', and 'Optax (Hessel et al., 2020)' with publication years, but does not provide specific version numbers for these software dependencies (e.g., PyTorch 1.9 or JAX 0.2.1). |
| Experiment Setup | Yes | For the outer loss hyperparameters, we use exactly the same hyperparameters that were used in the IMPALA paper for all of our agents (gouter v = 0.25, gouter p = 1, gouter v = 1, λouter = 1), with one exception: we use γ = 0.995... For the initializations of the metaparameters we use the corresponding parameters in the outer loss, i.e., for any metaparameter ηi, we set ηInit i = 4.6 such that σ(ηInit i ) = 0.99... For the meta optimizer, we use ADAM with default settings (e.g., learning rate is set to 10 3), and for the the KL coefficient, we use gouter kl = 1). |