DynaMITE-RL: A Dynamic Model for Improved Temporal Meta-Reinforcement Learning
Authors: Anthony Liang, Guy Tennenholtz, Chih-wei Hsu, Yinlam Chow, Erdem Bıyık, Craig Boutilier
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present experiments that demonstrate, while Vari BAD and other meta-RL methods struggle to learn good policies given nonstationary latent contexts, Dyna MITE-RL exploits the causal structure of a DLCMDP to more efficiently learn performant policies. We compare our approach to several state-of-the-art meta-RL baselines, showing significantly better evaluation returns. We test Dyna MITE-RL on a suite of standard meta-RL benchmark tasks including a didactic gridworld navigation, continuous control, and human-in-the-loop robot assistance as shown in Figure 4. |
| Researcher Affiliation | Collaboration | Anthony Liang University of Southern California aliang80@usc.edu Guy Tennenholtz Google Research guytenn@google.com Chih-Wei Hsu Google Research cwhsu@google.com Yinlam Chow Google Deepmind yinlamchow@google.com Erdem Biyik University of Southern California erdem.biyik@usc.edu Craig Boutilier Google Research cboutilier@google.com |
| Pseudocode | Yes | Figure 3: Pseudo-code (online RL training) and model architecture of Dyna MITE-RL. Algorithm 1 Dyna MITE-RL. Algorithm 2 COLLECT_TRAJECTORY. |
| Open Source Code | No | Not at this point but we will release the code along with the camera ready version of the paper. |
| Open Datasets | Yes | We test Dyna MITE-RL on a suite of standard meta-RL benchmark tasks including a didactic gridworld navigation, continuous control, and human-in-the-loop robot assistance as shown in Figure 4. Gridworld navigation and Mu Jo Co [41] locomotion tasks are considered by Zintgraf et al. [47], Dorfman et al. [12], and Choshen and Tamar [10]. We modify these environments to incorporate temporal shifts in the reward function and/or environment dynamics. Reacher is a two-joint robot arm task part of Open AI s Mu Jo Co tasks [6]. Assistive Itch Scratching is part of the Assistive-Gym benchmark [15]. |
| Dataset Splits | No | The paper does not explicitly specify distinct training/validation/test dataset splits with percentages or sample counts for reproduction. While it discusses training and evaluation, the common dataset split terminology for validation is not present. |
| Hardware Specification | Yes | All experiments can be run on a single Nvidia RTX A6000 GPU. |
| Software Dependencies | No | For our study, we use the Brax [17] simulator, a physics engine for large scale rigid body simulation written in JAX. We use JAX [2], a machine learning framework... We used Proximal Policy Optimization (PPO) training. No specific version numbers for these software components or other libraries are provided. |
| Experiment Setup | Yes | In this section, we provide the hyperparameter values used for training each of the baselines and Dyna MITE-RL. We also provide more detailed explanation of the model architecture used for each method. Table 4: Training hyperparameters. Table 5: Hyperparameters for Transformer Encoder. |