reproducibilityindex.ai

Real-Time Recurrent Learning using Trace Units in Reinforcement Learning

Authors: Esraa Elelimy, Adam White, Michael Bowling, Martha White

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we build on these insights to provide a lightweight but effective approach for training RNNs in online RL. We introduce Recurrent Trace Units (RTUs), a small modification on LRUs that we nonetheless find to have significant performance benefits over LRUs when trained with RTRL. We find RTUs significantly outperform other recurrent architectures across several partially observable environments while using significantly less computation.1
Researcher Affiliation	Academia	Esraa Elelimy, Adam White , Michael Bowling , Martha White University of Alberta, Alberta Machine Intelligence Institute (Amii) Canada CIFAR AI Chair elelimy,amw8,mbowling,whitem@ualberta.ca
Pseudocode	Yes	In Appendix G, Algorithm 1, we provide the pseudocode for integrating RTRL methods with PPO with optional steps for re-running the network to update the RTRL gradient traces, the value targets, and the advantage estimates.
Open Source Code	Yes	1Code available at https://github.com/esraaelelimy/rtus
Open Datasets	Yes	We use a simple multi-step prediction task called Trace conditioning [39] inspired by experiments in animal learning. We use the standard Mujoco POMDP benchmark widely used in prior work for evaluating memory-based RL agents [31, 12, 27, 32]. We use several tasks from the POPGym benchmark [29].
Dataset Splits	No	No. The paper mentions tuning hyperparameters by running agents with different step sizes and selecting the 'best step size value' but does not describe this as using a dedicated 'validation set' or 'validation split' of a dataset.
Hardware Specification	Yes	We ran the Mujoco-P, Mujoco-V on NVIDIA P100 GPU.
Software Dependencies	No	No. The paper mentions software like 'Adam optimizer', 'PPO [40]', 'Jax implementation of Mujoco from the Brax library [7]', but it does not provide specific version numbers for any of these software components.
Experiment Setup	Yes	Table 1: Hyper Parameters for PPO. Buffer size 2048 Num epochs 10 Number of Mini-batches 32 GAE,λ 0.95 Discount factor, γ 0.99 policy clip parameter 0.2 Value loss clip parameter 0.5 Gradient clip parameter 0.5 Optimizer Adam Optimizer step size [1e 05, 3e 05, 1e 04, 3e 04, 1e 03]