Discovering Reinforcement Learning Algorithms
Authors: Junhyuk Oh, Matteo Hessel, Wojciech M. Czarnecki, Zhongwen Xu, Hado P. van Hasselt, Satinder Singh, David Silver
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results show that our method discovers its own alternative to the concept of value functions. Furthermore it discovers a bootstrapping mechanism to maintain and use its predictions. |
| Researcher Affiliation | Industry | Corresponding author: junhyuk@google.com |
| Pseudocode | Yes | Algorithm 1 Meta-Training of Learned Policy Gradient |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | No | For meta-training of LPG, we introduce three different kinds of toy domains as illustrated Figure 2. Tabular grid worlds are grid worlds with fixed object locations. Random grid worlds have randomised object locations for each episode. Delayed chain MDPs are simple MDPs with delayed rewards. |
| Dataset Splits | No | The paper describes 'Training Environments' and 'Atari games' for meta-training and meta-testing respectively, but it does not specify explicit training, validation, and test dataset splits with percentages or counts for reproduction. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The algorithm is implemented using JAX [5]. |
| Experiment Setup | Yes | We used a 30-dimensional prediction vector y [0, 1]30. During meta-training, we updated the agent parameters after every 20 time-steps. |