Discovering Reinforcement Learning Algorithms

Authors: Junhyuk Oh, Matteo Hessel, Wojciech M. Czarnecki, Zhongwen Xu, Hado P. van Hasselt, Satinder Singh, David Silver

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results show that our method discovers its own alternative to the concept of value functions. Furthermore it discovers a bootstrapping mechanism to maintain and use its predictions.
Researcher Affiliation Industry Corresponding author: junhyuk@google.com
Pseudocode Yes Algorithm 1 Meta-Training of Learned Policy Gradient
Open Source Code No The paper does not contain an explicit statement about releasing source code or a link to a code repository for the described methodology.
Open Datasets No For meta-training of LPG, we introduce three different kinds of toy domains as illustrated Figure 2. Tabular grid worlds are grid worlds with fixed object locations. Random grid worlds have randomised object locations for each episode. Delayed chain MDPs are simple MDPs with delayed rewards.
Dataset Splits No The paper describes 'Training Environments' and 'Atari games' for meta-training and meta-testing respectively, but it does not specify explicit training, validation, and test dataset splits with percentages or counts for reproduction.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The algorithm is implemented using JAX [5].
Experiment Setup Yes We used a 30-dimensional prediction vector y [0, 1]30. During meta-training, we updated the agent parameters after every 20 time-steps.