reproducibilityindex.ai

Policy Contrastive Imitation Learning

Authors: Jialei Huang, Zhao-Heng Yin, Yingdong Hu, Yang Gao

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Furthermore, our empirical evaluation on the Deep Mind Control suite demonstrates that PCIL can achieve state-of-the-art performance. Finally, qualitative results suggest that PCIL builds a smoother and more meaningful representation space for imitation learning.
Researcher Affiliation	Academia	1Department of IIIS, University of Tsinghua, Beijing, China 2Shanghai Artificial Intelligence Laboratory, Shanghai, China 3Shanghai Qi Zhi Institute, Shanghai, China 4Hong Kong University of Science and Technology, Hong Kong, China. Correspondence to: Yang Gao <gaoyangiiis@mail.tsinghua.edu.cn>.
Pseudocode	No	The paper describes the proposed algorithm using mathematical equations and natural language, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	We will release our code and data.
Open Datasets	Yes	We experiment with 10 Mu Jo Co (Todorov et al., 2012) tasks provided by Deep Mind Control Suite (Tassa et al., 2018), a widely used benchmark for continuous control. Our experiments are designed to answer the following questions
Dataset Splits	No	The paper describes its experimental setup including environment steps, batch sizes, and hyperparameters, but it does not specify a distinct validation dataset split with percentages or counts for its experiments. Training is performed using an online RL approach with replay buffers rather than static dataset splits.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running its experiments.
Software Dependencies	No	The paper mentions several algorithms and frameworks (e.g., Dr Q-v2, DDPG, clipped double Q-learning, DPG, Adam optimizer), but it does not specify exact version numbers for any software libraries or dependencies, such as Python, PyTorch, or TensorFlow versions.
Experiment Setup	Yes	Training Details To update the encoder, we randomly sample 128 expert transitions and 128 agent transitions from a replay buffer. For arbitrary expert transition, any other expert transition is considered a positive sample, and all the agent transitions constitute the set of negative samples. We update the encoder by minimizing Equation 1 with respect to these samples. We use Dr Q-v2 (Yarats et al., 2021) as the underlying RL algorithm to train the agent with the cosine similarity reward given in Equation 2. We use a budget of 2M environment steps for all the experiments. Further implementation details can be found in Appendix B. Table 3 lists the hyperparameters that are used for all baseline methods and our method. Expert data ratio in PCIL means the ratio between expert data and batch size. A ratio of 0.5 means that half of the batch is expert data and the other half is the agent data. The contrastive learning usually needs a temperature scaling after computing the cos-similarity, before computing the exponential. For simplicity, we ignored it in the main text. In the experiment, we follow prior contrastive learning work (He et al., 2020) and use a typical value of 0.07 for the temperature.