Information asymmetry in KL-regularized RL

Authors: Alexandre Galashov, Siddhant M. Jayakumar, Leonard Hasenclever, Dhruva Tirumala, Jonathan Schwarz, Guillaume Desjardins, Wojciech M. Czarnecki, Yee Whye Teh, Razvan Pascanu, Nicolas Heess

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present empirical results in both discrete and continuous action domains and demonstrate that, for certain tasks, learning a default policy alongside the policy can significantly speed up and improve learning.
Researcher Affiliation Industry Alexandre Galashov, Siddhant M. Jayakumar, Leonard Hasenclever, Dhruva Tirumala, Jonathan Schwarz, Guillaume Desjardins, Wojciech M. Czarnecki, Yee Whye Teh, Razvan Pascanu, Nicolas Heess Deep Mind London, UK {agalashov,sidmj,leonardh,dhruvat,schwarzjn,gdesjardins, lejlot,ywteh,razp,heess}@google.com
Pseudocode Yes In algorithm 1 we provide pseudo-code for actor-critic version of the algorithm with K-step returns. and Algorithm 2 is an off-policy version with retraced Q function of the initial algorithm 1.
Open Source Code No The paper does not contain an explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets No The paper refers to the DMLab-30 set of environments, citing Beattie et al. (2016), which describes the environment itself rather than providing concrete access information (link, DOI, specific repository, or citation to a downloadable dataset) for a fixed dataset used in training. The continuous control experiments also use simulated environments, not external datasets.
Dataset Splits No The paper does not provide specific percentages or sample counts for training, validation, or test dataset splits, nor does it reference predefined splits with explicit citations for data partitioning.
Hardware Specification No The paper mentions a distributed actor-learner architecture with a varying number of actors but does not specify any particular hardware details such as GPU models, CPU models, or memory specifications used for the experiments.
Software Dependencies No The paper mentions various algorithms and models used (e.g., SVG(0), V-trace, ResNet, LSTM) but does not provide specific version numbers for any software dependencies or libraries required for replication.
Experiment Setup Yes The paper provides detailed hyperparameter settings in Appendix D.2, including actor and critic learning rates, network sizes, batch size, unroll length, entropy bonus, and regularization constants for various tasks.