reproducibilityindex.ai

DynaMITE-RL: A Dynamic Model for Improved Temporal Meta-Reinforcement Learning

Authors: Anthony Liang, Guy Tennenholtz, Chih-wei Hsu, Yinlam Chow, Erdem Bıyık, Craig Boutilier

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present experiments that demonstrate, while Vari BAD and other meta-RL methods struggle to learn good policies given nonstationary latent contexts, Dyna MITE-RL exploits the causal structure of a DLCMDP to more efﬁciently learn performant policies. We compare our approach to several state-of-the-art meta-RL baselines, showing signiﬁcantly better evaluation returns. We test Dyna MITE-RL on a suite of standard meta-RL benchmark tasks including a didactic gridworld navigation, continuous control, and human-in-the-loop robot assistance as shown in Figure 4.
Researcher Affiliation	Collaboration	Anthony Liang University of Southern California aliang80@usc.edu Guy Tennenholtz Google Research guytenn@google.com Chih-Wei Hsu Google Research cwhsu@google.com Yinlam Chow Google Deepmind yinlamchow@google.com Erdem Biyik University of Southern California erdem.biyik@usc.edu Craig Boutilier Google Research cboutilier@google.com
Pseudocode	Yes	Figure 3: Pseudo-code (online RL training) and model architecture of Dyna MITE-RL. Algorithm 1 Dyna MITE-RL. Algorithm 2 COLLECT_TRAJECTORY.
Open Source Code	No	Not at this point but we will release the code along with the camera ready version of the paper.
Open Datasets	Yes	We test Dyna MITE-RL on a suite of standard meta-RL benchmark tasks including a didactic gridworld navigation, continuous control, and human-in-the-loop robot assistance as shown in Figure 4. Gridworld navigation and Mu Jo Co [41] locomotion tasks are considered by Zintgraf et al. [47], Dorfman et al. [12], and Choshen and Tamar [10]. We modify these environments to incorporate temporal shifts in the reward function and/or environment dynamics. Reacher is a two-joint robot arm task part of Open AI s Mu Jo Co tasks [6]. Assistive Itch Scratching is part of the Assistive-Gym benchmark [15].
Dataset Splits	No	The paper does not explicitly specify distinct training/validation/test dataset splits with percentages or sample counts for reproduction. While it discusses training and evaluation, the common dataset split terminology for validation is not present.
Hardware Specification	Yes	All experiments can be run on a single Nvidia RTX A6000 GPU.
Software Dependencies	No	For our study, we use the Brax [17] simulator, a physics engine for large scale rigid body simulation written in JAX. We use JAX [2], a machine learning framework... We used Proximal Policy Optimization (PPO) training. No specific version numbers for these software components or other libraries are provided.
Experiment Setup	Yes	In this section, we provide the hyperparameter values used for training each of the baselines and Dyna MITE-RL. We also provide more detailed explanation of the model architecture used for each method. Table 4: Training hyperparameters. Table 5: Hyperparameters for Transformer Encoder.