Learning transferable motor skills with hierarchical latent mixture policies
Authors: Dushyant Rao, Fereshteh Sadeghi, Leonard Hasenclever, Markus Wulfmeier, Martina Zambelli, Giulia Vezzani, Dhruva Tirumala, Yusuf Aytar, Josh Merel, Nicolas Heess, raia hadsell
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate in manipulation domains that the method can effectively cluster offline data into distinct, executable behaviours, while retaining the flexibility of a continuous latent variable model. The resulting skills can be transferred and fine-tuned on new tasks, unseen objects, and from state to vision-based policies, yielding better sample efficiency and asymptotic performance compared to existing skilland imitation-based methods. We further analyse how and when the skills are most beneficial: they encourage directed exploration to cover large regions of the state space relevant to the task, making them most effective in challenging sparse-reward settings. |
| Researcher Affiliation | Industry | Dushyant Rao , Fereshteh Sadeghi, Leonard Hasenclever, Markus Wulfmeier, Martina Zambelli, Giulia Vezzani, Dhruva Tirumala, Yusuf Aytar, Josh Merel , Nicolas Heess, & Raia Hadsell Deep Mind, London, UK |
| Pseudocode | No | No pseudocode or clearly labeled algorithm blocks were found in the paper. The methodology is described through text, graphical models, and mathematical equations. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a link to a code repository. |
| Open Datasets | Yes | We focus on manipulation tasks, using a Mu Jo Co-based environment with a single Sawyer arm, and three objects coloured red, green, and blue. We follow the challenging object stacking benchmark of Lee et al. (2021), which specifies five object sets (Figure 2)... To evaluate our approach and baselines in the manipulation settings, we use two datasets: red_on_blue_stacking: this data is collected by an agent trained to stack the red object on the blue object and ignore the green one, for the simplest object set, set4. all_pairs_stacking: similar to the previous case, but with all six pairwise stacking combinations of {red, green, blue}, and covering all of the five object sets. |
| Dataset Splits | No | The paper mentions using 'red_on_blue_stacking' and 'all_pairs_stacking' datasets for offline learning and subsequent transfer experiments, but does not provide specific percentages or counts for training, validation, and test splits. It implies using the full dataset for offline learning and then evaluating on various transfer scenarios without explicitly defining traditional splits. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It mentions using a 'Mu Jo Co-based environment' but not the underlying hardware that runs the simulation or experiments. |
| Software Dependencies | No | The paper mentions software components such as 'MPO' (Maximum a posteriori Policy Optimisation), 'RHPO' (Hindsight Off-policy Option Learning), 'Mu Jo Co' (a physics engine), and 'Res Net' (a neural network architecture), but does not provide specific version numbers for any of them, which is required for reproducible software dependency information. |
| Experiment Setup | Yes | The network architecture details and hyperparameters for He LMS are shown in Table 5. Parameter sweeps were performed for the β coefficients during offline learning and the η coefficients during RL. Small sweeps were also performed for the RHPO ϵ parameters... All RL experiments were run with 3 seeds to capture variation in each method. For network architectures, all experiments except for vision used simple 2-layer MLPs... Table 5: Hyperparameters and architecture details for He LMS, for both offline training and RL. |