Imitation Learning from Purified Demonstrations
Authors: Yunke Wang, Minjing Dong, Yukun Zhao, Bo Du, Chang Xu
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results on Mu Jo Co and Robo Suite demonstrate the effectiveness of our method from different aspects.In this section, we conduct extensive experiments to verify the effectiveness of DP-IL in Mu Jo Co (Todorov et al., 2012) and Robosuite (Zhu et al., 2020) with different compared methods. The experimental results demonstrate the advantage of DP-IL from different aspects. |
| Researcher Affiliation | Academia | 1School of Computer Science, National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence and Wuhan Institute of Data Intelligence, Wuhan University, China. 2Department of Computer Science, City University of Hong Kong, China. 3School of Computer Science, Faculty of Engineering, The University of Sydney, Australia. |
| Pseudocode | Yes | The pseudo code of diffusion model s training and purification is available in Algorithm 1 and Algorithm 2. |
| Open Source Code | Yes | Our source code and training data will be available at https://github.com/yunke-wang/dp-il. |
| Open Datasets | Yes | We first conduct experiments on Mu Jo Co benchmarks in Open AI Gym (Brockman et al., 2016).We also evaluate the robustness of DP-BC on the Robo Suite platform (Zhu et al., 2020) with real-world demonstrations.We use real-world demonstrations by human operators from Robo Turk (Mandlekar et al., 2018). |
| Dataset Splits | No | The paper uses optimal and sub-optimal demonstrations for training but does not explicitly state specific training/validation/test dataset splits, percentages, or sample counts needed for reproduction. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions implementing algorithms based on a DDPM repository and using TRPO, but it does not provide specific version numbers for software components, libraries, or frameworks used in the experiments. |
| Experiment Setup | Yes | The training epoch is set to be 10000 and the learning rate of ϵϕ is set to 1e-4. We set N = 1000 for all experiments and set the forward process variances to constants increasing linearly from β1 = 1e 4 to βN = 0.02. ...the policy is trained with batch size 256, and the total epoch is set to be 1000. For online imitation learning, the learning rate of the discriminator Dψ and the critic rψ is set to 3 10 4. ...The discount rate γ of the sampled trajectory is set to 0.995. The τ (GAE parameter) is set to 0.97. |