CEIL: Generalized Contextual Imitation Learning
Authors: Jinxin Liu, Li He, Yachen Kang, Zifeng Zhuang, Donglin Wang, Huazhe Xu
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we evaluate CEIL on the popular Mu Jo Co tasks (online) and the D4RL dataset (offline). Compared to prior state-of-the-art baselines, we show that CEIL is more sample-efficient in most online IL tasks and achieves better or competitive performances in offline tasks. |
| Researcher Affiliation | Collaboration | Jinxin Liu1,2 Li He1 Yachen Kang1,2 Zifeng Zhuang1,2 Donglin Wang1,4 Huazhe Xu3,5,6 1Westlake University 2Zhejiang University 3Tsinghua University 4Westlake Institute for Advanced Study 5Shanghai Qi Zhi Institute 6Shanghai AI Lab |
| Pseudocode | Yes | Algorithm 1 Training CEIL: Online or Offline IL Setting |
| Open Source Code | Yes | Our code will be released at https://github.com/wechto/Generalized CEIL. |
| Open Datasets | Yes | Our experiments are conducted in four popular Mu Jo Co environments: Hopper-v2 (Hop.), Half Cheetah-v2 (Hal.), Walker2d-v2 (Wal.), and Ant.-v2. In the single-domain IL setting, we train a SAC policy in each environment and use the learned expert policy to collect expert trajectories (demonstrations/observations). To investigate the cross-domain IL setting, we assume the two domains (learning MDP and the expert-data collecting MDP) have the same state space and action space, while they have different transition dynamics. To achieve this, we modify the torso length of the Mu Jo Co agents (see details in Appendix 9.2). Then, for each modified agent, we train a separate expert policy and collect expert trajectories. For the offline IL setting, we directly take the reward-free D4RL [22] as the offline dataset, replacing the online rollout experience in the online IL setting. |
| Dataset Splits | No | The paper states it uses D4RL datasets (medium, medium-replay, medium-expert) and samples trajectories directly from the given offline data. However, it does not provide explicit percentages or counts for train/validation/test splits performed by the authors on these datasets, nor does it refer to specific predefined splits for their use. For online IL, it mentions using an experience replay buffer but no explicit splitting. |
| Hardware Specification | No | The paper does not specify any hardware details such as specific GPU/CPU models, processor types, memory amounts, or cloud computing resources used for running the experiments. |
| Software Dependencies | No | The paper mentions using “publicly available rlkit implementation of SAC” and “default Pytorch scheduler” but does not provide specific version numbers for these software components or any other libraries used. |
| Experiment Setup | Yes | In Table 9, we list the hyper-parameters used in the experiments. For the size of the embedding dictionary, we selected it from a range of [512, 1024, 2048, 4096]. We found 4096 to almost uniformly attain good performance across IL tasks, thus selecting it as the default. For the size of the embedding dimension, we tried four values [4, 8, 16, 32] and selected 16 as the default. For the trajectory window size, we tried five values [2, 4, 8, 16, 32] but we did not observe a significant difference in performance across these values. Thus we selected 2 as the default value. For the learning rate scheduler, we tried the default Pytorch scheduler and Cosine Annealing Warm Restarts, and found Cosine Annealing Warm Restarts enables better results (thus we selected it). For other hyperparameters, they are consistent with the default values of most RL implementations, e.g. learning rate 3e-4 and the MLP policy. (Table 9, page 14 provides specific parameter values). |