Curriculum Offline Imitating Learning

Authors: Minghuan Liu, Hanye Zhao, Zhengyu Yang, Jian Shen, Weinan Zhang, Li Zhao, Tie-Yan Liu

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
Researcher Affiliation Collaboration Minghuan Liu1 Hanye Zhao1 Zhengyu Yang1 Jian Shen1 Weinan Zhang1 Li Zhao2 Tie-Yan Liu2 1 Shanghai Jiao Tong University, 2 Microsoft Research {minghuanliu, fineartz, zyyang, rockyshen, wnzhang}@sjtu.edu.cn, {lizo,tyliu}@microsoft.com
Pseudocode Yes The step-by-step algorithm is shown in Algo. 1.
Open Source Code Yes Codes are available at https://github.com/apexrl/COIL.
Open Datasets Yes To further show the power of COIL, we conduct comparison experiments on a common-used D4RL benchmark [5] in Tab. 2.
Dataset Splits No The paper refers to training iterations and 'online evaluation' but does not explicitly state training, validation, and test dataset splits with percentages or sample counts for reproducibility.
Hardware Specification No The paper does not explicitly describe the hardware used for experiments, such as specific CPU or GPU models, or cloud computing resources with specifications.
Software Dependencies No The paper mentions using 'open-source implementation' for baselines and 'our implementation of BC' but does not provide specific version numbers for software dependencies (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup Yes It is worth noting that COIL has only two critical hyperparameters, namely, the number of selected trajectories N and the moving window of the return filter α, both of which can be determined by the property of the dataset. Specifically, N is related to the average discrepancy between the sampling policies in the dataset; α is influenced by the changes of the return of the trajectories contained in the dataset. In the ablation study Section 6.3 and Appendix E.2, we demonstrate how we select different hyperparameters for different datasets.