Information asymmetry in KL-regularized RL
Authors: Alexandre Galashov, Siddhant M. Jayakumar, Leonard Hasenclever, Dhruva Tirumala, Jonathan Schwarz, Guillaume Desjardins, Wojciech M. Czarnecki, Yee Whye Teh, Razvan Pascanu, Nicolas Heess
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present empirical results in both discrete and continuous action domains and demonstrate that, for certain tasks, learning a default policy alongside the policy can significantly speed up and improve learning. |
| Researcher Affiliation | Industry | Alexandre Galashov, Siddhant M. Jayakumar, Leonard Hasenclever, Dhruva Tirumala, Jonathan Schwarz, Guillaume Desjardins, Wojciech M. Czarnecki, Yee Whye Teh, Razvan Pascanu, Nicolas Heess Deep Mind London, UK {agalashov,sidmj,leonardh,dhruvat,schwarzjn,gdesjardins, lejlot,ywteh,razp,heess}@google.com |
| Pseudocode | Yes | In algorithm 1 we provide pseudo-code for actor-critic version of the algorithm with K-step returns. and Algorithm 2 is an off-policy version with retraced Q function of the initial algorithm 1. |
| Open Source Code | No | The paper does not contain an explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | No | The paper refers to the DMLab-30 set of environments, citing Beattie et al. (2016), which describes the environment itself rather than providing concrete access information (link, DOI, specific repository, or citation to a downloadable dataset) for a fixed dataset used in training. The continuous control experiments also use simulated environments, not external datasets. |
| Dataset Splits | No | The paper does not provide specific percentages or sample counts for training, validation, or test dataset splits, nor does it reference predefined splits with explicit citations for data partitioning. |
| Hardware Specification | No | The paper mentions a distributed actor-learner architecture with a varying number of actors but does not specify any particular hardware details such as GPU models, CPU models, or memory specifications used for the experiments. |
| Software Dependencies | No | The paper mentions various algorithms and models used (e.g., SVG(0), V-trace, ResNet, LSTM) but does not provide specific version numbers for any software dependencies or libraries required for replication. |
| Experiment Setup | Yes | The paper provides detailed hyperparameter settings in Appendix D.2, including actor and critic learning rates, network sizes, batch size, unroll length, entropy bonus, and regularization constants for various tasks. |