Composing Complex Skills by Learning Transition Policies

Authors: Youngwoon Lee*, Shao-Hua Sun*, Sriram Somasundaram, Edward S. Hu, Joseph J. Lim

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The proposed method is evaluated on a set of complex continuous control tasks in bipedal locomotion and robotic arm manipulation which traditional policy gradient methods struggle at. We conducted experiments on two classes of continuous control tasks: robotic manipulation and locomotion.
Researcher Affiliation Academia University of Southern California {lee504,shaohuas,sriramso,hues,limjj}@usc.edu
Pseudocode Yes Algorithm 1 TRAIN and Algorithm 2 ROLLOUT provide structured pseudocode blocks.
Open Source Code Yes We make our environments, primitive skills, and code public for further research at https://youngwoon.github.io/transition.
Open Datasets No The paper describes simulated environments (Mu Jo Co physics engine) and custom-trained primitive skills. While these environments and code are made public, the paper does not refer to or provide access information for a pre-existing, publicly available dataset in the conventional sense (e.g., CIFAR-10, ImageNet).
Dataset Splits No The paper does not specify explicit training, validation, or test dataset splits (e.g., percentages or counts) for any data used in the experiments. In reinforcement learning, policies often learn directly in the environment, rather than on pre-split datasets.
Hardware Specification No The paper states that environments are simulated in the Mu Jo Co physics engine but does not specify any hardware details such as GPU models, CPU types, or memory used for running the experiments or simulations.
Software Dependencies No The paper mentions using "Open AI baselines (Dhariwal et al., 2017)" for TRPO and PPO implementation and the "Adam optimizer" but does not provide specific version numbers for these software components or any other libraries/dependencies.
Experiment Setup Yes Table 3, titled "Hyperparameter values for transition policy, proximity predictor, and primitive policy as well as TRPO and PPO baselines", lists specific values for learning rate, mini-batch size, and number of mini-batches. Section B.1 also states, "For all networks, we use the Adam optimizer with mini-batch size of 64. We use 4 workers for rollout and parameter update. The size of rollout for each update is 10,000 steps. We limit the maximum length of a transition trajectory as 100."