Choreographer: Learning and Adapting Skills in Imagination
Authors: Pietro Mazzaglia, Tim Verbelen, Bart Dhoedt, Alexandre Lacoste, Sai Rajeswar
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically evaluate Choreographer to answer the following questions: Does Choreographer learn and adapt skills effectively for unsupervised RL? (Section 4.1) We use the URL benchmark (Laskin et al., 2021) to show that, after pre-training on exploration data, our agent can adapt to several tasks in a data-efficient manner. We show this holds for both training on states and pixels inputs and in both offline and online settings. |
| Researcher Affiliation | Collaboration | 1 Ghent University imec 2 Service Now Research |
| Pseudocode | Yes | We reported in appendix the detailed pseudo-code for both Choreographer (Algorithm 2) and code-resampling (Algorithm 1). |
| Open Source Code | Yes | Project website: https://skillchoreographer.github.io/ [...] The code is publicly available through the project website. |
| Open Datasets | Yes | We use the URL benchmark (Laskin et al., 2021) to show that, after pre-training on exploration data, our agent can adapt to several tasks in a data-efficient manner. [...] In the URLB, there are three domains from the DM Control Suite (Tassa et al., 2018), Walker, Quadruped, and Jaco, with four downstream tasks to solve per each domain. [...] We zoom in on the Jaco sparse tasks from URLB and use sparse goal-reaching tasks from Meta-World (Yu et al., 2019) to evaluate the ability of our agent to find sparse rewards in the environment, by leveraging its skills. |
| Dataset Splits | Yes | The experimental setup consists of two phases: a longer data collection/pre-training phase, where the agent has up to 2M environment steps to interact with the environment without rewards, and a shorter fine-tuning phase, where the agent has 100k steps to interact with a task-specific version of the environment, where it should both discover rewards and solve a downstream task. |
| Hardware Specification | Yes | In the URLB paper, they claim a training time of about 12 hours for 2M frames pre-training using V100 GPUs for their model-free agents. Using the same GPUs, we can pre-train a Choreographer agent in about 44 hours. |
| Software Dependencies | No | The paper mentions software like 'Dreamer V2' and 'Adam' but does not specify version numbers for these or other key software dependencies (e.g., 'Python 3.8', 'PyTorch 1.9', 'CUDA 11.1'). |
| Experiment Setup | Yes | Implementation details and hyperparameters are provided in Appendix B and the open-source code is available on the project website. [...] We use Adam with learning rate 3 10 4 for the updates, and clipping gradients norm to 100. [...] The actor and critic networks are instantiated as 4-layer MLP with a dimensionality of 400 and updated using Adam with a learning rate of 8 10 5 and gradients clipped to 100. [...] The codebook consists of 64 codes of dimensionality 16. Code resampling happens every M = 200 training batches. |