Agnostic Interactive Imitation Learning: New Theory and Practical Algorithms

Authors: Yichen Li, Chicheng Zhang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, MFTPL-P and BOOTSTRAP-DAGGER notably surpass online and offline imitation learning baselines in continuous control tasks. Our experiments are designed to answer the following questions: Q1: Does sample-based perturbation provide any benefit in MFTPL-P? Q2: How does the choice of covering distribution d0 affect the performance of MFTPL-P? Q3: Does MFTPL-P outperform online and offline IL baselines? Q4: Can we find a practical variant of MFTPL-P that achieves similar performance to MFTPL-P without additional sample access to some covering distribution? Q5: If Q3 and Q4 are true, which component of our algorithms confers this advantage?
Researcher Affiliation Academia 1 Department of Computer Science, University of Arizona, Tucson, AZ, USA.
Pseudocode Yes Algorithm 1 MFTPL-P; Algorithm 3 MFTPL-P (Mixed Following The Perturbed Leader with Poisson Perturbations); Algorithm 4 BOOTSTRAP-DAGGER
Open Source Code Yes For code and more information see https://github.com/liyichen1998/Bootstrap Dagger-MFTPLP
Open Datasets Yes Our experiments are designed to answer the following questions: Q1: Does sample-based perturbation provide any benefit in MFTPL-P? Q2: How does the choice of covering distribution d0 affect the performance of MFTPL-P? Q3: Does MFTPL-P outperform online and offline IL baselines? Q4: Can we find a practical variant of MFTPL-P that achieves similar performance to MFTPL-P without additional sample access to some covering distribution? Q5: If Q3 and Q4 are true, which component of our algorithms confers this advantage? We study the impact of perturbation size X and the choice of d0 on the performance of MP-25(X). Here, we choose DAGGER as the baseline; note that this is equivalent to MP-25(0) given that the offline learning oracle returns OLS solutions deterministically. We consider two settings of d0 in Section 5.1. We perform evaluations in realizable and non-realizable settings using MLPs as base policy classes. In the realizable setting, the base policy class contains the conditional mean function of the expert policy. Meanwhile, the non-realizable setting considers the base policy class to be MLPs with one hidden layer and limited numbers of nodes (see Appendix C.2 and C.4 for details).
Dataset Splits No The paper does not provide explicit training/test/validation dataset splits. It discusses training models and evaluating them, but not the partitioning of datasets into these specific splits with percentages or counts.
Hardware Specification Yes All experiments were conducted on an Ubuntu machine equipped with a 3.3 GHz Intel Core i9 CPU and 4 NVIDIA Ge Force RTX 2080 Ti GPUs.
Software Dependencies No The paper mentions: "Our project is built upon the source code of Disagreement-Regularized Imitation Learning (https://github.com/xkianteb/dril) and shares the same environment dependencies." It also states the operating system (Ubuntu) but does not provide specific version numbers for software libraries or dependencies, which is required for reproducibility.
Experiment Setup Yes Table 3: Hyperparameters for Continuous Control Experiment lists values for Hyperparameter Values Considered and Chosen Value, including 'Learning Rate 2.5e-4', 'Batch Size 200', 'Train Epoch 2000', 'Parallel Environments 25'.