Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Discovering Reinforcement Learning Algorithms
Authors: Junhyuk Oh, Matteo Hessel, Wojciech M. Czarnecki, Zhongwen Xu, Hado P. van Hasselt, Satinder Singh, David Silver
NeurIPS 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results show that our method discovers its own alternative to the concept of value functions. Furthermore it discovers a bootstrapping mechanism to maintain and use its predictions. |
| Researcher Affiliation | Industry | Corresponding author: EMAIL |
| Pseudocode | Yes | Algorithm 1 Meta-Training of Learned Policy Gradient |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | No | For meta-training of LPG, we introduce three different kinds of toy domains as illustrated Figure 2. Tabular grid worlds are grid worlds with fixed object locations. Random grid worlds have randomised object locations for each episode. Delayed chain MDPs are simple MDPs with delayed rewards. |
| Dataset Splits | No | The paper describes 'Training Environments' and 'Atari games' for meta-training and meta-testing respectively, but it does not specify explicit training, validation, and test dataset splits with percentages or counts for reproduction. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The algorithm is implemented using JAX [5]. |
| Experiment Setup | Yes | We used a 30-dimensional prediction vector y [0, 1]30. During meta-training, we updated the agent parameters after every 20 time-steps. |