Off-Dynamics Reinforcement Learning via Domain Adaptation and Reward Augmented Imitation

Authors: Yihong Guo, Yixuan Wang, Yuanyuan Shi, Pan Xu, Anqi Liu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, our method outperforms the pure modified reward method without imitation learning and also outperforms other baselines in benchmark off-dynamics environments. In this section, we conduct experiments on off-dynamics reinforcement learning settings on four Open AI environments: Half Cheetah-v2, Ant-v2, Walker2d-v2, and Reacher-v2. We compare our method with seven baselines and demonstrate the superiority of the proposed DARAIL.
Researcher Affiliation Academia Yihong Guo1, Yixuan Wang1, Yuanyuan Shi2, Pan Xu3, Anqi Liu1 1Johns Hopkins University 2University of California San Diego 3Duke University
Pseudocode Yes Algorithm 1 Domain Adaptation and Reward Augmented Imitation Learning (DARAIL)
Open Source Code Yes Code is available at https://github.com/guoyihonggyh/Off-Dynamics-Reinforcement-Learning-via-Domain Adaptation-and-Reward-Augmented-Imitation.
Open Datasets Yes We conducted experiments on four Mujoco environments, namely, Half Cheetah, Ant, Walker2d, and Reacher on modified gravity/density configurations and broken action environments.
Dataset Splits No The paper mentions training and evaluation but does not specify the explicit training/validation/test dataset splits (e.g., percentages or sample counts).
Hardware Specification Yes We run the experiment on a single GPU: NVIDIA RTX A5000-24564Mi B with 8-CPUs: AMD Ryzen Threadripper 3960X 24-Core.
Software Dependencies No The paper mentions environments like Mujoco but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions or other libraries).
Experiment Setup Yes For a fair comparison, we tune the parameters of baselines and our method. The hidden layers of the policy and value network are [256,256] for the Half Cheetah, Ant, and Walker2d and [64,64] for Reacher. And the hidden layer of the two classifiers is [64] for the Half Cheetah, Ant, and Walker2d and [32] for Reacher. The batch size is set to be 256. We fairly tune the learning rate from [3e 4, 1e 4, 5e 5, 1e 5]. For those methods that require the importance weight ρ, we tune the update steps of the two classifiers trained to obtain the importance weight from [10, 50, 100]. We also add Gaussian noise ϵ N(0, 1) to the input of the classifiers for regularization, and the noise scale is selected from [0.1, 0.2, 1.0].