Composing Complex Skills by Learning Transition Policies
Authors: Youngwoon Lee*, Shao-Hua Sun*, Sriram Somasundaram, Edward S. Hu, Joseph J. Lim
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The proposed method is evaluated on a set of complex continuous control tasks in bipedal locomotion and robotic arm manipulation which traditional policy gradient methods struggle at. We conducted experiments on two classes of continuous control tasks: robotic manipulation and locomotion. |
| Researcher Affiliation | Academia | University of Southern California {lee504,shaohuas,sriramso,hues,limjj}@usc.edu |
| Pseudocode | Yes | Algorithm 1 TRAIN and Algorithm 2 ROLLOUT provide structured pseudocode blocks. |
| Open Source Code | Yes | We make our environments, primitive skills, and code public for further research at https://youngwoon.github.io/transition. |
| Open Datasets | No | The paper describes simulated environments (Mu Jo Co physics engine) and custom-trained primitive skills. While these environments and code are made public, the paper does not refer to or provide access information for a pre-existing, publicly available dataset in the conventional sense (e.g., CIFAR-10, ImageNet). |
| Dataset Splits | No | The paper does not specify explicit training, validation, or test dataset splits (e.g., percentages or counts) for any data used in the experiments. In reinforcement learning, policies often learn directly in the environment, rather than on pre-split datasets. |
| Hardware Specification | No | The paper states that environments are simulated in the Mu Jo Co physics engine but does not specify any hardware details such as GPU models, CPU types, or memory used for running the experiments or simulations. |
| Software Dependencies | No | The paper mentions using "Open AI baselines (Dhariwal et al., 2017)" for TRPO and PPO implementation and the "Adam optimizer" but does not provide specific version numbers for these software components or any other libraries/dependencies. |
| Experiment Setup | Yes | Table 3, titled "Hyperparameter values for transition policy, proximity predictor, and primitive policy as well as TRPO and PPO baselines", lists specific values for learning rate, mini-batch size, and number of mini-batches. Section B.1 also states, "For all networks, we use the Adam optimizer with mini-batch size of 64. We use 4 workers for rollout and parameter update. The size of rollout for each update is 10,000 steps. We limit the maximum length of a transition trajectory as 100." |