Off-Dynamics Reinforcement Learning via Domain Adaptation and Reward Augmented Imitation
Authors: Yihong Guo, Yixuan Wang, Yuanyuan Shi, Pan Xu, Anqi Liu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, our method outperforms the pure modified reward method without imitation learning and also outperforms other baselines in benchmark off-dynamics environments. In this section, we conduct experiments on off-dynamics reinforcement learning settings on four Open AI environments: Half Cheetah-v2, Ant-v2, Walker2d-v2, and Reacher-v2. We compare our method with seven baselines and demonstrate the superiority of the proposed DARAIL. |
| Researcher Affiliation | Academia | Yihong Guo1, Yixuan Wang1, Yuanyuan Shi2, Pan Xu3, Anqi Liu1 1Johns Hopkins University 2University of California San Diego 3Duke University |
| Pseudocode | Yes | Algorithm 1 Domain Adaptation and Reward Augmented Imitation Learning (DARAIL) |
| Open Source Code | Yes | Code is available at https://github.com/guoyihonggyh/Off-Dynamics-Reinforcement-Learning-via-Domain Adaptation-and-Reward-Augmented-Imitation. |
| Open Datasets | Yes | We conducted experiments on four Mujoco environments, namely, Half Cheetah, Ant, Walker2d, and Reacher on modified gravity/density configurations and broken action environments. |
| Dataset Splits | No | The paper mentions training and evaluation but does not specify the explicit training/validation/test dataset splits (e.g., percentages or sample counts). |
| Hardware Specification | Yes | We run the experiment on a single GPU: NVIDIA RTX A5000-24564Mi B with 8-CPUs: AMD Ryzen Threadripper 3960X 24-Core. |
| Software Dependencies | No | The paper mentions environments like Mujoco but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions or other libraries). |
| Experiment Setup | Yes | For a fair comparison, we tune the parameters of baselines and our method. The hidden layers of the policy and value network are [256,256] for the Half Cheetah, Ant, and Walker2d and [64,64] for Reacher. And the hidden layer of the two classifiers is [64] for the Half Cheetah, Ant, and Walker2d and [32] for Reacher. The batch size is set to be 256. We fairly tune the learning rate from [3e 4, 1e 4, 5e 5, 1e 5]. For those methods that require the importance weight ρ, we tune the update steps of the two classifiers trained to obtain the importance weight from [10, 50, 100]. We also add Gaussian noise ϵ N(0, 1) to the input of the classifiers for regularization, and the noise scale is selected from [0.1, 0.2, 1.0]. |