reproducibilityindex.ai

Agnostic Interactive Imitation Learning: New Theory and Practical Algorithms

Authors: Yichen Li, Chicheng Zhang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, MFTPL-P and BOOTSTRAP-DAGGER notably surpass online and ofﬂine imitation learning baselines in continuous control tasks. Our experiments are designed to answer the following questions: Q1: Does sample-based perturbation provide any beneﬁt in MFTPL-P? Q2: How does the choice of covering distribution d0 affect the performance of MFTPL-P? Q3: Does MFTPL-P outperform online and ofﬂine IL baselines? Q4: Can we ﬁnd a practical variant of MFTPL-P that achieves similar performance to MFTPL-P without additional sample access to some covering distribution? Q5: If Q3 and Q4 are true, which component of our algorithms confers this advantage?
Researcher Affiliation	Academia	1 Department of Computer Science, University of Arizona, Tucson, AZ, USA.
Pseudocode	Yes	Algorithm 1 MFTPL-P; Algorithm 3 MFTPL-P (Mixed Following The Perturbed Leader with Poisson Perturbations); Algorithm 4 BOOTSTRAP-DAGGER
Open Source Code	Yes	For code and more information see https://github.com/liyichen1998/Bootstrap Dagger-MFTPLP
Open Datasets	Yes	Our experiments are designed to answer the following questions: Q1: Does sample-based perturbation provide any beneﬁt in MFTPL-P? Q2: How does the choice of covering distribution d0 affect the performance of MFTPL-P? Q3: Does MFTPL-P outperform online and ofﬂine IL baselines? Q4: Can we ﬁnd a practical variant of MFTPL-P that achieves similar performance to MFTPL-P without additional sample access to some covering distribution? Q5: If Q3 and Q4 are true, which component of our algorithms confers this advantage? We study the impact of perturbation size X and the choice of d0 on the performance of MP-25(X). Here, we choose DAGGER as the baseline; note that this is equivalent to MP-25(0) given that the ofﬂine learning oracle returns OLS solutions deterministically. We consider two settings of d0 in Section 5.1. We perform evaluations in realizable and non-realizable settings using MLPs as base policy classes. In the realizable setting, the base policy class contains the conditional mean function of the expert policy. Meanwhile, the non-realizable setting considers the base policy class to be MLPs with one hidden layer and limited numbers of nodes (see Appendix C.2 and C.4 for details).
Dataset Splits	No	The paper does not provide explicit training/test/validation dataset splits. It discusses training models and evaluating them, but not the partitioning of datasets into these specific splits with percentages or counts.
Hardware Specification	Yes	All experiments were conducted on an Ubuntu machine equipped with a 3.3 GHz Intel Core i9 CPU and 4 NVIDIA Ge Force RTX 2080 Ti GPUs.
Software Dependencies	No	The paper mentions: "Our project is built upon the source code of Disagreement-Regularized Imitation Learning (https://github.com/xkianteb/dril) and shares the same environment dependencies." It also states the operating system (Ubuntu) but does not provide specific version numbers for software libraries or dependencies, which is required for reproducibility.
Experiment Setup	Yes	Table 3: Hyperparameters for Continuous Control Experiment lists values for Hyperparameter Values Considered and Chosen Value, including 'Learning Rate 2.5e-4', 'Batch Size 200', 'Train Epoch 2000', 'Parallel Environments 25'.