Real-Time Recurrent Learning using Trace Units in Reinforcement Learning
Authors: Esraa Elelimy, Adam White, Michael Bowling, Martha White
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we build on these insights to provide a lightweight but effective approach for training RNNs in online RL. We introduce Recurrent Trace Units (RTUs), a small modification on LRUs that we nonetheless find to have significant performance benefits over LRUs when trained with RTRL. We find RTUs significantly outperform other recurrent architectures across several partially observable environments while using significantly less computation.1 |
| Researcher Affiliation | Academia | Esraa Elelimy, Adam White , Michael Bowling , Martha White University of Alberta, Alberta Machine Intelligence Institute (Amii) Canada CIFAR AI Chair elelimy,amw8,mbowling,whitem@ualberta.ca |
| Pseudocode | Yes | In Appendix G, Algorithm 1, we provide the pseudocode for integrating RTRL methods with PPO with optional steps for re-running the network to update the RTRL gradient traces, the value targets, and the advantage estimates. |
| Open Source Code | Yes | 1Code available at https://github.com/esraaelelimy/rtus |
| Open Datasets | Yes | We use a simple multi-step prediction task called Trace conditioning [39] inspired by experiments in animal learning. We use the standard Mujoco POMDP benchmark widely used in prior work for evaluating memory-based RL agents [31, 12, 27, 32]. We use several tasks from the POPGym benchmark [29]. |
| Dataset Splits | No | No. The paper mentions tuning hyperparameters by running agents with different step sizes and selecting the 'best step size value' but does not describe this as using a dedicated 'validation set' or 'validation split' of a dataset. |
| Hardware Specification | Yes | We ran the Mujoco-P, Mujoco-V on NVIDIA P100 GPU. |
| Software Dependencies | No | No. The paper mentions software like 'Adam optimizer', 'PPO [40]', 'Jax implementation of Mujoco from the Brax library [7]', but it does not provide specific version numbers for any of these software components. |
| Experiment Setup | Yes | Table 1: Hyper Parameters for PPO. Buffer size 2048 Num epochs 10 Number of Mini-batches 32 GAE,λ 0.95 Discount factor, γ 0.99 policy clip parameter 0.2 Value loss clip parameter 0.5 Gradient clip parameter 0.5 Optimizer Adam Optimizer step size [1e 05, 3e 05, 1e 04, 3e 04, 1e 03] |