Policy Contrastive Imitation Learning

Authors: Jialei Huang, Zhao-Heng Yin, Yingdong Hu, Yang Gao

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Furthermore, our empirical evaluation on the Deep Mind Control suite demonstrates that PCIL can achieve state-of-the-art performance. Finally, qualitative results suggest that PCIL builds a smoother and more meaningful representation space for imitation learning.
Researcher Affiliation Academia 1Department of IIIS, University of Tsinghua, Beijing, China 2Shanghai Artificial Intelligence Laboratory, Shanghai, China 3Shanghai Qi Zhi Institute, Shanghai, China 4Hong Kong University of Science and Technology, Hong Kong, China. Correspondence to: Yang Gao <gaoyangiiis@mail.tsinghua.edu.cn>.
Pseudocode No The paper describes the proposed algorithm using mathematical equations and natural language, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code No We will release our code and data.
Open Datasets Yes We experiment with 10 Mu Jo Co (Todorov et al., 2012) tasks provided by Deep Mind Control Suite (Tassa et al., 2018), a widely used benchmark for continuous control. Our experiments are designed to answer the following questions
Dataset Splits No The paper describes its experimental setup including environment steps, batch sizes, and hyperparameters, but it does not specify a distinct validation dataset split with percentages or counts for its experiments. Training is performed using an online RL approach with replay buffers rather than static dataset splits.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running its experiments.
Software Dependencies No The paper mentions several algorithms and frameworks (e.g., Dr Q-v2, DDPG, clipped double Q-learning, DPG, Adam optimizer), but it does not specify exact version numbers for any software libraries or dependencies, such as Python, PyTorch, or TensorFlow versions.
Experiment Setup Yes Training Details To update the encoder, we randomly sample 128 expert transitions and 128 agent transitions from a replay buffer. For arbitrary expert transition, any other expert transition is considered a positive sample, and all the agent transitions constitute the set of negative samples. We update the encoder by minimizing Equation 1 with respect to these samples. We use Dr Q-v2 (Yarats et al., 2021) as the underlying RL algorithm to train the agent with the cosine similarity reward given in Equation 2. We use a budget of 2M environment steps for all the experiments. Further implementation details can be found in Appendix B. Table 3 lists the hyperparameters that are used for all baseline methods and our method. Expert data ratio in PCIL means the ratio between expert data and batch size. A ratio of 0.5 means that half of the batch is expert data and the other half is the agent data. The contrastive learning usually needs a temperature scaling after computing the cos-similarity, before computing the exponential. For simplicity, we ignored it in the main text. In the experiment, we follow prior contrastive learning work (He et al., 2020) and use a typical value of 0.07 for the temperature.