Agnostic Interactive Imitation Learning: New Theory and Practical Algorithms
Authors: Yichen Li, Chicheng Zhang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, MFTPL-P and BOOTSTRAP-DAGGER notably surpass online and offline imitation learning baselines in continuous control tasks. Our experiments are designed to answer the following questions: Q1: Does sample-based perturbation provide any benefit in MFTPL-P? Q2: How does the choice of covering distribution d0 affect the performance of MFTPL-P? Q3: Does MFTPL-P outperform online and offline IL baselines? Q4: Can we find a practical variant of MFTPL-P that achieves similar performance to MFTPL-P without additional sample access to some covering distribution? Q5: If Q3 and Q4 are true, which component of our algorithms confers this advantage? |
| Researcher Affiliation | Academia | 1 Department of Computer Science, University of Arizona, Tucson, AZ, USA. |
| Pseudocode | Yes | Algorithm 1 MFTPL-P; Algorithm 3 MFTPL-P (Mixed Following The Perturbed Leader with Poisson Perturbations); Algorithm 4 BOOTSTRAP-DAGGER |
| Open Source Code | Yes | For code and more information see https://github.com/liyichen1998/Bootstrap Dagger-MFTPLP |
| Open Datasets | Yes | Our experiments are designed to answer the following questions: Q1: Does sample-based perturbation provide any benefit in MFTPL-P? Q2: How does the choice of covering distribution d0 affect the performance of MFTPL-P? Q3: Does MFTPL-P outperform online and offline IL baselines? Q4: Can we find a practical variant of MFTPL-P that achieves similar performance to MFTPL-P without additional sample access to some covering distribution? Q5: If Q3 and Q4 are true, which component of our algorithms confers this advantage? We study the impact of perturbation size X and the choice of d0 on the performance of MP-25(X). Here, we choose DAGGER as the baseline; note that this is equivalent to MP-25(0) given that the offline learning oracle returns OLS solutions deterministically. We consider two settings of d0 in Section 5.1. We perform evaluations in realizable and non-realizable settings using MLPs as base policy classes. In the realizable setting, the base policy class contains the conditional mean function of the expert policy. Meanwhile, the non-realizable setting considers the base policy class to be MLPs with one hidden layer and limited numbers of nodes (see Appendix C.2 and C.4 for details). |
| Dataset Splits | No | The paper does not provide explicit training/test/validation dataset splits. It discusses training models and evaluating them, but not the partitioning of datasets into these specific splits with percentages or counts. |
| Hardware Specification | Yes | All experiments were conducted on an Ubuntu machine equipped with a 3.3 GHz Intel Core i9 CPU and 4 NVIDIA Ge Force RTX 2080 Ti GPUs. |
| Software Dependencies | No | The paper mentions: "Our project is built upon the source code of Disagreement-Regularized Imitation Learning (https://github.com/xkianteb/dril) and shares the same environment dependencies." It also states the operating system (Ubuntu) but does not provide specific version numbers for software libraries or dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | Table 3: Hyperparameters for Continuous Control Experiment lists values for Hyperparameter Values Considered and Chosen Value, including 'Learning Rate 2.5e-4', 'Batch Size 200', 'Train Epoch 2000', 'Parallel Environments 25'. |