Offline Imitation Learning with Suboptimal Demonstrations via Relaxed Distribution Matching

Authors: Lantao Yu, Tianhe Yu, Jiaming Song, Willie Neiswanger, Stefano Ermon

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive empirical study shows that our method significantly outperforms the best prior offline IL method in six standard continuous control environments with over 30% performance gain on average, across 22 settings where the imperfect dataset is highly suboptimal.
Researcher Affiliation Collaboration Lantao Yu*1, Tianhe Yu*1, Jiaming Song 2, Willie Neiswanger 1, Stefano Ermon 1 1 Computer Science Department, Stanford University 2 NVIDIA (Work done while at Stanford)
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements or links indicating that source code for the methodology is openly available.
Open Datasets Yes We consider offline datasets of four Mu Jo Co (Todorov, Erez, and Tassa 2012) locomotion environments (hopper, halfcheetah, walker2d and ant) and two Adroit robotic manipulation environments (hammer and relocate) from the standard offline RL benchmark D4RL (Fu et al. 2020).
Dataset Splits No The paper describes the construction of datasets by mixing expert and random data from D4RL, but it does not specify explicit training, validation, and test splits (e.g., percentages or counts) for these combined datasets that would allow for reproduction of data partitioning.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions using multilayer perceptron (MLP) networks and refers to 'gradient penalty (Gulrajani et al. 2017)' but does not provide specific version numbers for any software, libraries, or frameworks used in the implementation.
Experiment Setup Yes For all the tasks, we use α = 0.2 for Relax DICE and use α = 0.05 for Demo DICE as suggested in (Kim et al. 2021), which is also verified in our experiments. We pick α and β for Relax DICE-DRC via grid search, which we will discuss in the appendix. For more details of the experiment set-ups, evaluation protocols, hyperparameters and practical implementations, please see the appendix.