Offline Imitation Learning with Suboptimal Demonstrations via Relaxed Distribution Matching
Authors: Lantao Yu, Tianhe Yu, Jiaming Song, Willie Neiswanger, Stefano Ermon
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive empirical study shows that our method significantly outperforms the best prior offline IL method in six standard continuous control environments with over 30% performance gain on average, across 22 settings where the imperfect dataset is highly suboptimal. |
| Researcher Affiliation | Collaboration | Lantao Yu*1, Tianhe Yu*1, Jiaming Song 2, Willie Neiswanger 1, Stefano Ermon 1 1 Computer Science Department, Stanford University 2 NVIDIA (Work done while at Stanford) |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements or links indicating that source code for the methodology is openly available. |
| Open Datasets | Yes | We consider offline datasets of four Mu Jo Co (Todorov, Erez, and Tassa 2012) locomotion environments (hopper, halfcheetah, walker2d and ant) and two Adroit robotic manipulation environments (hammer and relocate) from the standard offline RL benchmark D4RL (Fu et al. 2020). |
| Dataset Splits | No | The paper describes the construction of datasets by mixing expert and random data from D4RL, but it does not specify explicit training, validation, and test splits (e.g., percentages or counts) for these combined datasets that would allow for reproduction of data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions using multilayer perceptron (MLP) networks and refers to 'gradient penalty (Gulrajani et al. 2017)' but does not provide specific version numbers for any software, libraries, or frameworks used in the implementation. |
| Experiment Setup | Yes | For all the tasks, we use α = 0.2 for Relax DICE and use α = 0.05 for Demo DICE as suggested in (Kim et al. 2021), which is also verified in our experiments. We pick α and β for Relax DICE-DRC via grid search, which we will discuss in the appendix. For more details of the experiment set-ups, evaluation protocols, hyperparameters and practical implementations, please see the appendix. |