Demonstration-Conditioned Reinforcement Learning for Few-Shot Imitation

Authors: Christopher R. Dance, Julien Perez, Théo Cachet

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present results on robotic manipulation and navigation benchmarks, demonstrating DCRL's superior performance compared with state-of-the-art alternatives, as well as its ability to improve on suboptimal demonstrations and to cope with domain shifts.
Researcher Affiliation Industry 1NAVER LABS Europe, 6 chemin de Maupertuis, Meylan, 38240, France. Website:europe.naverlabs.com. Correspondence to: Théo Cachet <theo.cachet@naverlabs.com>.
Pseudocode Yes Algorithm 1 Demonstration-conditioned reinforcement learning
Open Source Code No The paper cites 'Meta-World source code' for the benchmark used (Yu et al., 2019c), but it does not provide any statement or link indicating that the authors' own DCRL implementation code is open-source or publicly available.
Open Datasets Yes We use Meta-World, a robotic manipulation benchmark, originally designed to assess the performance of metalearning algorithms (Yu et al., 2019b). ... Our second benchmark involves 60 tasks, each corresponding to a maze layout. ... the transition function is computed with Viz Doom (Kempka et al., 2016).
Dataset Splits No The paper describes training and testing splits for tasks (e.g., 'trained on 45 tasks and tested on 5 hold-out tasks' for Meta-World, 'train on a fixed set of 50 mazes and test on the remaining 10 mazes' for Navigation), but it does not explicitly mention or provide details for a separate 'validation' dataset split for hyperparameter tuning or early stopping within the main text.
Hardware Specification Yes It takes about one day to train DCRL for both benchmarks, using a Tesla V100 GPU. ... On this benchmark, using an Nvidia 2080 Ti GPU, the execution time of our transformer-based architecture is as follows.
Software Dependencies No The paper mentions using PPO, Meta-World, MuJoCo, and Viz Doom but does not specify the version numbers for any of these software dependencies, which are necessary for full reproducibility.
Experiment Setup No While the paper describes aspects of the training process (e.g., sampling 5000 demonstrations per task, training for 250 million environment frames, using PPO), it explicitly states that 'Full details can be found in the Supplementary Material' regarding hyperparameters, implying specific setup values like learning rates or batch sizes are not provided within the main text.