Training Transition Policies via Distribution Matching for Complex Tasks

Authors: JU-SEUNG BYUN, Andrew Perrault

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate our method on continuous bipedal locomotion and arm manipulation tasks that require diverse skills. We show that it smoothly connects the lower-level policies, achieving higher success rates than previous methods that search for successful trajectories based on a reward function, but do not match the state distribution. We evaluate our method on six continuous bipedal locomotion and arm manipulation environments created by Lee et al. (2019): Repetitive picking up, Repetitive catching, Serve, Patrol, Hurdle, and Obstacle course. All of the environments are simulated with the Mu Jo Co physics engine (Todorov et al., 2012). The tasks are detailed in Section 5.1 and Section 5.2. Section 5.3 shows the results our method performs better than using a single policy or simply using pre-trained policies and is comparable to Lee et al. (2019) for the arm manipulation tasks and much stronger for the locomotion tasks. In Section 5.4, we visualize how a transition policy trained through IRL resembles the distribution of pre-trained policies.
Researcher Affiliation Academia Ju-Seung Byun & Andrew Perrault Department of Computer Science & Engineering The Ohio State University Columbus, OH 43210, USA {byun.83,perrault.17}@osu.edu
Pseudocode Yes Algorithm 1 Training Transition Policy πa,b and Algorithm 2 Training Deep Q Network qa,b
Open Source Code Yes The source code is available at https://github.com/shashacks/ IRL_Transition.
Open Datasets Yes We evaluate our method on six continuous bipedal locomotion and arm manipulation environments created by Lee et al. (2019): Repetitive picking up, Repetitive catching, Serve, Patrol, Hurdle, and Obstacle course.
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits (percentages or counts) or reference predefined splits for the environments used.
Hardware Specification No The paper mentions that environments are simulated with the MuJoCo physics engine but does not specify any hardware details such as CPU, GPU models, or memory used for experiments.
Software Dependencies No The paper mentions using PPO, Adam optimizer, and MuJoCo physics engine, but it does not provide specific version numbers for these software components or other libraries (e.g., PyTorch, TensorFlow, etc.) used.
Experiment Setup Yes We use the Adam optimizer (Kingma & Ba, 2015) with mini-batch size of 64 and learning rate 1e-4 for the transition policies and the learning rate 3e-4 for the discriminators. Our DQNs have a two-layer Re LU with 128 units. The Adam optimizer is used with learning rate 1e-4 and mini-batch size of 64 to update DQNs. All of the replay buffers of DQNs store one million samples. We simply set rs to 1 and rf to -1 for all of the six environments.