Curriculum Offline Imitating Learning
Authors: Minghuan Liu, Hanye Zhao, Zhengyu Yang, Jian Shen, Weinan Zhang, Li Zhao, Tie-Yan Liu
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods. |
| Researcher Affiliation | Collaboration | Minghuan Liu1 Hanye Zhao1 Zhengyu Yang1 Jian Shen1 Weinan Zhang1 Li Zhao2 Tie-Yan Liu2 1 Shanghai Jiao Tong University, 2 Microsoft Research {minghuanliu, fineartz, zyyang, rockyshen, wnzhang}@sjtu.edu.cn, {lizo,tyliu}@microsoft.com |
| Pseudocode | Yes | The step-by-step algorithm is shown in Algo. 1. |
| Open Source Code | Yes | Codes are available at https://github.com/apexrl/COIL. |
| Open Datasets | Yes | To further show the power of COIL, we conduct comparison experiments on a common-used D4RL benchmark [5] in Tab. 2. |
| Dataset Splits | No | The paper refers to training iterations and 'online evaluation' but does not explicitly state training, validation, and test dataset splits with percentages or sample counts for reproducibility. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for experiments, such as specific CPU or GPU models, or cloud computing resources with specifications. |
| Software Dependencies | No | The paper mentions using 'open-source implementation' for baselines and 'our implementation of BC' but does not provide specific version numbers for software dependencies (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | It is worth noting that COIL has only two critical hyperparameters, namely, the number of selected trajectories N and the moving window of the return filter α, both of which can be determined by the property of the dataset. Specifically, N is related to the average discrepancy between the sampling policies in the dataset; α is influenced by the changes of the return of the trajectories contained in the dataset. In the ablation study Section 6.3 and Appendix E.2, we demonstrate how we select different hyperparameters for different datasets. |