MCP: Learning Composable Hierarchical Control with Multiplicative Compositional Policies

Authors: Xue Bin Peng, Michael Chang, Grace Zhang, Pieter Abbeel, Sergey Levine

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that MCP is able to extract composable skills for highly complex simulated characters from pre-training tasks, such as motion imitation, and then reuse these skills to solve challenging continuous control tasks, such as dribbling a soccer ball to a goal, and picking up an object and transporting it to a target location.
Researcher Affiliation Academia Xue Bin Peng, Michael Chang, Grace Zhang, Pieter Abbeel, Sergey Levine Department of Electrical Engineering and Computer Science University of California, Berkeley {xbpeng, mbchang, grace.zhang}@berkeley.edu pabbeel@cs.berkeley.edu svlevine@eecs.berkeley.edu
Pseudocode Yes Algorithm 1 MCP Pre-Training and Transfer
Open Source Code No 1Supplementary video: xbpeng.github.io/projects/MCP/. This URL points to a project/video page, not a direct code repository link. The paper does not explicitly state that source code for the methodology is released or available via a code repository.
Open Datasets Yes We use a motion imitation approach following Peng et al. [32]... The corpus of motion clips is comprised of different walking and turning motions. The environment is a variant of the standard Gym Ant environment [4]. SFU. Sfu motion capture database. http://mocap.cs.sfu.ca/ [38].
Dataset Splits No No specific details on training/validation/test dataset splits (exact percentages, sample counts, or citations to predefined splits) were explicitly provided for all experiments, nor was cross-validation mentioned.
Hardware Specification No We would like to thank AWS, Google, and NVIDIA for providing computational resources. This statement is too general and does not specify particular hardware models (e.g., specific GPUs, CPUs, or cloud instance types).
Software Dependencies No The policies operate at 30Hz and are trained using proximal policy optimization (PPO) [37]. No specific software dependencies with version numbers (e.g., Python, PyTorch/TensorFlow, CUDA) are explicitly provided.
Experiment Setup Yes All experiments use a similar network architecture for the policy, as illustrated in Figure 3. Each policy is composed of k = 8 primitives. The gating function and primitives are modeled by separate networks that output w(s, g), µi:k(s), and Σi:k(s), which are then composed according to Equation 2 to produce the composite policy. The state describes the configuration of the character s body, with features consisting of the relative positions of each link with respect to the root, their rotations represented by quaternions, and their linear and angular velocities. Actions from the policy specify target rotations for PD controllers positioned at each joint. Target rotations for 3D spherical joints are parameterized using exponential maps. The policies operate at 30Hz and are trained using proximal policy optimization (PPO) [37].