reproducibilityindex.ai

Discovering Temporally-Aware Reinforcement Learning Algorithms

Authors: Matthew Thomas Jackson, Chris Lu, Louis Kirsch, Robert Tjarko Lange, Shimon Whiteson, Jakob Nicolaus Foerster

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the learned objective functions on both in-distribution and out-of-distribution environments, ranging from continuous control tasks to discrete Atari-like settings, over a range of training horizons. They significantly improve upon the performance of their non-temporally-aware counterparts, improving generalization to previously unseen training horizons and environments.
Researcher Affiliation	Academia	Matthew T. Jackson University of Oxford Chris Lu University of Oxford Louis Kirsch The Swiss AI Lab IDSIA Robert T. Lange Technical University Berlin Shimon Whiteson University of Oxford Jakob N. Foerster University of Oxford
Pseudocode	No	The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	Our implementation of LPG, LPO, and their temporally-aware modifications (TA-LPG, TA-LPO) can be found at https://github.com/EmptyJackson/groove.
Open Datasets	Yes	In our experiments, we follow the meta-training environments originally used in Oh et al. (2020) and Lu et al. (2022a) for LPG and LPO respectively. LPG is meta-trained in a multi-task setting over a continuous distribution of Grid-World environments, with variable training horizons per task. In contrast, LPO is meta-trained on a single environment (Min Atar Space Invaders (Young & Tian, 2019; Lange, 2022), with a fixed training horizon.
Dataset Splits	No	The paper describes meta-training and meta-testing environments but does not provide explicit training/validation/test dataset splits with percentages or sample counts for any single dataset.
Hardware Specification	Yes	Meta-training was done on 2 A100 GPUs with synchronous updates.
Software Dependencies	No	The paper mentions using JAX (Bradbury et al., 2018) and evosax (Lange, 2023) but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	Section 5.1 describes the experimental setup, and Appendix A and B (Table 1, Table 2) provide detailed hyperparameters. For example, 'Optimizer Adam Learning rate 1e-4', 'Discount factor 0.99', 'Policy entropy coefficient (β0) 0.05', 'ES Learning Rate Decay 0.999', 'Min Atar Number of Timesteps 1e7', 'Brax Number of Timesteps 5e7', and 'Learning Rate 3e-4' are explicitly stated.