reproducibilityindex.ai

Off-Dynamics Reinforcement Learning via Domain Adaptation and Reward Augmented Imitation

Authors: Yihong Guo, Yixuan Wang, Yuanyuan Shi, Pan Xu, Anqi Liu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, our method outperforms the pure modified reward method without imitation learning and also outperforms other baselines in benchmark off-dynamics environments. In this section, we conduct experiments on off-dynamics reinforcement learning settings on four Open AI environments: Half Cheetah-v2, Ant-v2, Walker2d-v2, and Reacher-v2. We compare our method with seven baselines and demonstrate the superiority of the proposed DARAIL.
Researcher Affiliation	Academia	Yihong Guo1, Yixuan Wang1, Yuanyuan Shi2, Pan Xu3, Anqi Liu1 1Johns Hopkins University 2University of California San Diego 3Duke University
Pseudocode	Yes	Algorithm 1 Domain Adaptation and Reward Augmented Imitation Learning (DARAIL)
Open Source Code	Yes	Code is available at https://github.com/guoyihonggyh/Off-Dynamics-Reinforcement-Learning-via-Domain Adaptation-and-Reward-Augmented-Imitation.
Open Datasets	Yes	We conducted experiments on four Mujoco environments, namely, Half Cheetah, Ant, Walker2d, and Reacher on modified gravity/density configurations and broken action environments.
Dataset Splits	No	The paper mentions training and evaluation but does not specify the explicit training/validation/test dataset splits (e.g., percentages or sample counts).
Hardware Specification	Yes	We run the experiment on a single GPU: NVIDIA RTX A5000-24564Mi B with 8-CPUs: AMD Ryzen Threadripper 3960X 24-Core.
Software Dependencies	No	The paper mentions environments like Mujoco but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions or other libraries).
Experiment Setup	Yes	For a fair comparison, we tune the parameters of baselines and our method. The hidden layers of the policy and value network are [256,256] for the Half Cheetah, Ant, and Walker2d and [64,64] for Reacher. And the hidden layer of the two classifiers is [64] for the Half Cheetah, Ant, and Walker2d and [32] for Reacher. The batch size is set to be 256. We fairly tune the learning rate from [3e 4, 1e 4, 5e 5, 1e 5]. For those methods that require the importance weight ρ, we tune the update steps of the two classifiers trained to obtain the importance weight from [10, 50, 100]. We also add Gaussian noise ϵ N(0, 1) to the input of the classifiers for regularization, and the noise scale is selected from [0.1, 0.2, 1.0].