DynaMITE-RL: A Dynamic Model for Improved Temporal Meta-Reinforcement Learning

Authors: Anthony Liang, Guy Tennenholtz, Chih-wei Hsu, Yinlam Chow, Erdem Bıyık, Craig Boutilier

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present experiments that demonstrate, while Vari BAD and other meta-RL methods struggle to learn good policies given nonstationary latent contexts, Dyna MITE-RL exploits the causal structure of a DLCMDP to more efficiently learn performant policies. We compare our approach to several state-of-the-art meta-RL baselines, showing significantly better evaluation returns. We test Dyna MITE-RL on a suite of standard meta-RL benchmark tasks including a didactic gridworld navigation, continuous control, and human-in-the-loop robot assistance as shown in Figure 4.
Researcher Affiliation Collaboration Anthony Liang University of Southern California aliang80@usc.edu Guy Tennenholtz Google Research guytenn@google.com Chih-Wei Hsu Google Research cwhsu@google.com Yinlam Chow Google Deepmind yinlamchow@google.com Erdem Biyik University of Southern California erdem.biyik@usc.edu Craig Boutilier Google Research cboutilier@google.com
Pseudocode Yes Figure 3: Pseudo-code (online RL training) and model architecture of Dyna MITE-RL. Algorithm 1 Dyna MITE-RL. Algorithm 2 COLLECT_TRAJECTORY.
Open Source Code No Not at this point but we will release the code along with the camera ready version of the paper.
Open Datasets Yes We test Dyna MITE-RL on a suite of standard meta-RL benchmark tasks including a didactic gridworld navigation, continuous control, and human-in-the-loop robot assistance as shown in Figure 4. Gridworld navigation and Mu Jo Co [41] locomotion tasks are considered by Zintgraf et al. [47], Dorfman et al. [12], and Choshen and Tamar [10]. We modify these environments to incorporate temporal shifts in the reward function and/or environment dynamics. Reacher is a two-joint robot arm task part of Open AI s Mu Jo Co tasks [6]. Assistive Itch Scratching is part of the Assistive-Gym benchmark [15].
Dataset Splits No The paper does not explicitly specify distinct training/validation/test dataset splits with percentages or sample counts for reproduction. While it discusses training and evaluation, the common dataset split terminology for validation is not present.
Hardware Specification Yes All experiments can be run on a single Nvidia RTX A6000 GPU.
Software Dependencies No For our study, we use the Brax [17] simulator, a physics engine for large scale rigid body simulation written in JAX. We use JAX [2], a machine learning framework... We used Proximal Policy Optimization (PPO) training. No specific version numbers for these software components or other libraries are provided.
Experiment Setup Yes In this section, we provide the hyperparameter values used for training each of the baselines and Dyna MITE-RL. We also provide more detailed explanation of the model architecture used for each method. Table 4: Training hyperparameters. Table 5: Hyperparameters for Transformer Encoder.