Discovering Temporally-Aware Reinforcement Learning Algorithms

Authors: Matthew Thomas Jackson, Chris Lu, Louis Kirsch, Robert Tjarko Lange, Shimon Whiteson, Jakob Nicolaus Foerster

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the learned objective functions on both in-distribution and out-of-distribution environments, ranging from continuous control tasks to discrete Atari-like settings, over a range of training horizons. They significantly improve upon the performance of their non-temporally-aware counterparts, improving generalization to previously unseen training horizons and environments.
Researcher Affiliation Academia Matthew T. Jackson University of Oxford Chris Lu University of Oxford Louis Kirsch The Swiss AI Lab IDSIA Robert T. Lange Technical University Berlin Shimon Whiteson University of Oxford Jakob N. Foerster University of Oxford
Pseudocode No The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code Yes Our implementation of LPG, LPO, and their temporally-aware modifications (TA-LPG, TA-LPO) can be found at https://github.com/EmptyJackson/groove.
Open Datasets Yes In our experiments, we follow the meta-training environments originally used in Oh et al. (2020) and Lu et al. (2022a) for LPG and LPO respectively. LPG is meta-trained in a multi-task setting over a continuous distribution of Grid-World environments, with variable training horizons per task. In contrast, LPO is meta-trained on a single environment (Min Atar Space Invaders (Young & Tian, 2019; Lange, 2022), with a fixed training horizon.
Dataset Splits No The paper describes meta-training and meta-testing environments but does not provide explicit training/validation/test dataset splits with percentages or sample counts for any single dataset.
Hardware Specification Yes Meta-training was done on 2 A100 GPUs with synchronous updates.
Software Dependencies No The paper mentions using JAX (Bradbury et al., 2018) and evosax (Lange, 2023) but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes Section 5.1 describes the experimental setup, and Appendix A and B (Table 1, Table 2) provide detailed hyperparameters. For example, 'Optimizer Adam Learning rate 1e-4', 'Discount factor 0.99', 'Policy entropy coefficient (β0) 0.05', 'ES Learning Rate Decay 0.999', 'Min Atar Number of Timesteps 1e7', 'Brax Number of Timesteps 5e7', and 'Learning Rate 3e-4' are explicitly stated.