Real-Time Recurrent Learning using Trace Units in Reinforcement Learning

Authors: Esraa Elelimy, Adam White, Michael Bowling, Martha White

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we build on these insights to provide a lightweight but effective approach for training RNNs in online RL. We introduce Recurrent Trace Units (RTUs), a small modification on LRUs that we nonetheless find to have significant performance benefits over LRUs when trained with RTRL. We find RTUs significantly outperform other recurrent architectures across several partially observable environments while using significantly less computation.1
Researcher Affiliation Academia Esraa Elelimy, Adam White , Michael Bowling , Martha White University of Alberta, Alberta Machine Intelligence Institute (Amii) Canada CIFAR AI Chair elelimy,amw8,mbowling,whitem@ualberta.ca
Pseudocode Yes In Appendix G, Algorithm 1, we provide the pseudocode for integrating RTRL methods with PPO with optional steps for re-running the network to update the RTRL gradient traces, the value targets, and the advantage estimates.
Open Source Code Yes 1Code available at https://github.com/esraaelelimy/rtus
Open Datasets Yes We use a simple multi-step prediction task called Trace conditioning [39] inspired by experiments in animal learning. We use the standard Mujoco POMDP benchmark widely used in prior work for evaluating memory-based RL agents [31, 12, 27, 32]. We use several tasks from the POPGym benchmark [29].
Dataset Splits No No. The paper mentions tuning hyperparameters by running agents with different step sizes and selecting the 'best step size value' but does not describe this as using a dedicated 'validation set' or 'validation split' of a dataset.
Hardware Specification Yes We ran the Mujoco-P, Mujoco-V on NVIDIA P100 GPU.
Software Dependencies No No. The paper mentions software like 'Adam optimizer', 'PPO [40]', 'Jax implementation of Mujoco from the Brax library [7]', but it does not provide specific version numbers for any of these software components.
Experiment Setup Yes Table 1: Hyper Parameters for PPO. Buffer size 2048 Num epochs 10 Number of Mini-batches 32 GAE,λ 0.95 Discount factor, γ 0.99 policy clip parameter 0.2 Value loss clip parameter 0.5 Gradient clip parameter 0.5 Optimizer Adam Optimizer step size [1e 05, 3e 05, 1e 04, 3e 04, 1e 03]