Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Off-Dynamics Reinforcement Learning via Domain Adaptation and Reward Augmented Imitation
Authors: Yihong Guo, Yixuan Wang, Yuanyuan Shi, Pan Xu, Anqi Liu
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, our method outperforms the pure modified reward method without imitation learning and also outperforms other baselines in benchmark off-dynamics environments. In this section, we conduct experiments on off-dynamics reinforcement learning settings on four Open AI environments: Half Cheetah-v2, Ant-v2, Walker2d-v2, and Reacher-v2. We compare our method with seven baselines and demonstrate the superiority of the proposed DARAIL. |
| Researcher Affiliation | Academia | Yihong Guo1, Yixuan Wang1, Yuanyuan Shi2, Pan Xu3, Anqi Liu1 1Johns Hopkins University 2University of California San Diego 3Duke University |
| Pseudocode | Yes | Algorithm 1 Domain Adaptation and Reward Augmented Imitation Learning (DARAIL) |
| Open Source Code | Yes | Code is available at https://github.com/guoyihonggyh/Off-Dynamics-Reinforcement-Learning-via-Domain Adaptation-and-Reward-Augmented-Imitation. |
| Open Datasets | Yes | We conducted experiments on four Mujoco environments, namely, Half Cheetah, Ant, Walker2d, and Reacher on modified gravity/density configurations and broken action environments. |
| Dataset Splits | No | The paper mentions training and evaluation but does not specify the explicit training/validation/test dataset splits (e.g., percentages or sample counts). |
| Hardware Specification | Yes | We run the experiment on a single GPU: NVIDIA RTX A5000-24564Mi B with 8-CPUs: AMD Ryzen Threadripper 3960X 24-Core. |
| Software Dependencies | No | The paper mentions environments like Mujoco but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions or other libraries). |
| Experiment Setup | Yes | For a fair comparison, we tune the parameters of baselines and our method. The hidden layers of the policy and value network are [256,256] for the Half Cheetah, Ant, and Walker2d and [64,64] for Reacher. And the hidden layer of the two classifiers is [64] for the Half Cheetah, Ant, and Walker2d and [32] for Reacher. The batch size is set to be 256. We fairly tune the learning rate from [3e 4, 1e 4, 5e 5, 1e 5]. For those methods that require the importance weight ρ, we tune the update steps of the two classifiers trained to obtain the importance weight from [10, 50, 100]. We also add Gaussian noise ϵ N(0, 1) to the input of the classifiers for regularization, and the noise scale is selected from [0.1, 0.2, 1.0]. |