Efficient Recurrent Off-Policy RL Requires a Context-Encoder-Specific Learning Rate
Authors: Fan-Ming Luo, Zuolin Tu, Zefang Huang, Yang Yu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluated RESe L in 18 POMDP tasks, including classic, meta-RL, and credit assignment scenarios, as well as five MDP locomotion tasks. The experiments demonstrate significant improvements in training stability with RESe L. Comparative results show that RESe L achieves notable performance improvements over previous recurrent RL baselines in POMDP tasks, and is competitive with or even surpasses state-of-the-art methods in MDP tasks. Further ablation studies highlight the necessity of applying a distinct learning rate for the context encoder. |
| Researcher Affiliation | Collaboration | 1 National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China 2 Polixir.ai |
| Pseudocode | Yes | Algorithm 1: Training Procedure of RESe L |
| Open Source Code | Yes | Code is available at https://github.com/Fanming L/Recurrent-Offpolicy-RL. |
| Open Datasets | Yes | These classic POMDP tasks5 are developed using Py Bullet locomotion environments. The environments include Ant BLT, Hopper BLT, Walker BLT, and Half Cheetah BLT. ... The Dynamics-Randomized Tasks6 environments are based on the Mu Jo Co environment [37]... These meta-RL tasks8 have been utilized in various meta-RL algorithms [42, 19]. ... The Key-to-Door environment9 |
| Dataset Splits | Yes | Following previous meta-RL work [16], we randomized the gravity in Mu Jo Co environments [37]. We created 60 dynamics functions with different gravities, using the first 40 for training and the remaining for testing. |
| Hardware Specification | Yes | All experiments were conducted on a workstation equipped with an Intel Xeon Gold 5218R CPU, four NVIDIA RTX 4090 GPUs, and 250GB of RAM, running Ubuntu 20.04. |
| Software Dependencies | No | The paper mentions using AdamW as an optimizer but does not provide specific version numbers for software dependencies such as libraries, frameworks (e.g., PyTorch, TensorFlow), or CUDA versions. |
| Experiment Setup | Yes | The hyperparameters used for RESe L is listed in Table 2. We mainly tuned the learning rates, batch size for each tasks. γ and last reward as input are determined according to the characteristic of the tasks. ... Table 2: Hyperparameters of RESe L. Attribute Value Task context encoder learning rate LRCE 2 10 6 classic Mu Jo Co and classic meta-RL tasks 10 5 other tasks other learning rate LRother for policy 6 10 5 classic Mu Jo Co and classic meta-RL tasks 3 10 4 other tasks other learning rate LRother for value 2 10 4 classic Mu Jo Co and classic meta-RL tasks 10 3 other tasks γ 0.9999 Key-to-Door 0.99 other tasks batch size 2000 classic POMDP tasks 1000 other tasks target entropy 1 Dimension of action all tasks learning rate of α 10 4 all tasks soft-update factor for target value network 0.995 all tasks number of the randomly sampled data 5000 all tasks |