Demonstration-Conditioned Reinforcement Learning for Few-Shot Imitation
Authors: Christopher R. Dance, Julien Perez, Théo Cachet
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present results on robotic manipulation and navigation benchmarks, demonstrating DCRL's superior performance compared with state-of-the-art alternatives, as well as its ability to improve on suboptimal demonstrations and to cope with domain shifts. |
| Researcher Affiliation | Industry | 1NAVER LABS Europe, 6 chemin de Maupertuis, Meylan, 38240, France. Website:europe.naverlabs.com. Correspondence to: Théo Cachet <theo.cachet@naverlabs.com>. |
| Pseudocode | Yes | Algorithm 1 Demonstration-conditioned reinforcement learning |
| Open Source Code | No | The paper cites 'Meta-World source code' for the benchmark used (Yu et al., 2019c), but it does not provide any statement or link indicating that the authors' own DCRL implementation code is open-source or publicly available. |
| Open Datasets | Yes | We use Meta-World, a robotic manipulation benchmark, originally designed to assess the performance of metalearning algorithms (Yu et al., 2019b). ... Our second benchmark involves 60 tasks, each corresponding to a maze layout. ... the transition function is computed with Viz Doom (Kempka et al., 2016). |
| Dataset Splits | No | The paper describes training and testing splits for tasks (e.g., 'trained on 45 tasks and tested on 5 hold-out tasks' for Meta-World, 'train on a fixed set of 50 mazes and test on the remaining 10 mazes' for Navigation), but it does not explicitly mention or provide details for a separate 'validation' dataset split for hyperparameter tuning or early stopping within the main text. |
| Hardware Specification | Yes | It takes about one day to train DCRL for both benchmarks, using a Tesla V100 GPU. ... On this benchmark, using an Nvidia 2080 Ti GPU, the execution time of our transformer-based architecture is as follows. |
| Software Dependencies | No | The paper mentions using PPO, Meta-World, MuJoCo, and Viz Doom but does not specify the version numbers for any of these software dependencies, which are necessary for full reproducibility. |
| Experiment Setup | No | While the paper describes aspects of the training process (e.g., sampling 5000 demonstrations per task, training for 250 million environment frames, using PPO), it explicitly states that 'Full details can be found in the Supplementary Material' regarding hyperparameters, implying specific setup values like learning rates or batch sizes are not provided within the main text. |