Efficient Recurrent Off-Policy RL Requires a Context-Encoder-Specific Learning Rate

Authors: Fan-Ming Luo, Zuolin Tu, Zefang Huang, Yang Yu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluated RESe L in 18 POMDP tasks, including classic, meta-RL, and credit assignment scenarios, as well as five MDP locomotion tasks. The experiments demonstrate significant improvements in training stability with RESe L. Comparative results show that RESe L achieves notable performance improvements over previous recurrent RL baselines in POMDP tasks, and is competitive with or even surpasses state-of-the-art methods in MDP tasks. Further ablation studies highlight the necessity of applying a distinct learning rate for the context encoder.
Researcher Affiliation Collaboration 1 National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China 2 Polixir.ai
Pseudocode Yes Algorithm 1: Training Procedure of RESe L
Open Source Code Yes Code is available at https://github.com/Fanming L/Recurrent-Offpolicy-RL.
Open Datasets Yes These classic POMDP tasks5 are developed using Py Bullet locomotion environments. The environments include Ant BLT, Hopper BLT, Walker BLT, and Half Cheetah BLT. ... The Dynamics-Randomized Tasks6 environments are based on the Mu Jo Co environment [37]... These meta-RL tasks8 have been utilized in various meta-RL algorithms [42, 19]. ... The Key-to-Door environment9
Dataset Splits Yes Following previous meta-RL work [16], we randomized the gravity in Mu Jo Co environments [37]. We created 60 dynamics functions with different gravities, using the first 40 for training and the remaining for testing.
Hardware Specification Yes All experiments were conducted on a workstation equipped with an Intel Xeon Gold 5218R CPU, four NVIDIA RTX 4090 GPUs, and 250GB of RAM, running Ubuntu 20.04.
Software Dependencies No The paper mentions using AdamW as an optimizer but does not provide specific version numbers for software dependencies such as libraries, frameworks (e.g., PyTorch, TensorFlow), or CUDA versions.
Experiment Setup Yes The hyperparameters used for RESe L is listed in Table 2. We mainly tuned the learning rates, batch size for each tasks. γ and last reward as input are determined according to the characteristic of the tasks. ... Table 2: Hyperparameters of RESe L. Attribute Value Task context encoder learning rate LRCE 2 10 6 classic Mu Jo Co and classic meta-RL tasks 10 5 other tasks other learning rate LRother for policy 6 10 5 classic Mu Jo Co and classic meta-RL tasks 3 10 4 other tasks other learning rate LRother for value 2 10 4 classic Mu Jo Co and classic meta-RL tasks 10 3 other tasks γ 0.9999 Key-to-Door 0.99 other tasks batch size 2000 classic POMDP tasks 1000 other tasks target entropy 1 Dimension of action all tasks learning rate of α 10 4 all tasks soft-update factor for target value network 0.995 all tasks number of the randomly sampled data 5000 all tasks