Iterative Regularized Policy Optimization with Imperfect Demonstrations

Authors: Gong Xudong, Feng Dawei, Kele Xu, Yuanzhao Zhai, Chengkang Yao, Weijia Wang, Bo Ding, Huaimin Wang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental validations conducted across widely used benchmarks and a novel fixed-wing UAV control task consistently demonstrate the effectiveness of IRPO in improving both the demonstration quality and the policy performance.
Researcher Affiliation Collaboration 1College of Computer, National University of Defense Technology, Changsha, Hunan, China 2State Key Laboratory of Complex & Critical Software Environment, Changsha, Hunan, China 3Flight Automatic Control Research Institute, AVIC, Xian, Shanxi, China.
Pseudocode Yes Algorithm 1 Iterative Regularized Policy Optimization (IRPO) method
Open Source Code Yes Our code is available at https://github.com/Gong Xudong/IRPO.
Open Datasets Yes Articulated-body control: Halfcheetah and Hopper tasks on the Mu Jo Co physics engine with D4RL (Fu et al., 2020) datasets. Robotic Arm Control: Modified Reach task (Gallou edec et al., 2021) on the Bullet physics engine with demonstrations generated by a PID controller. Fixed-wing UAV Attitude Control: Attitude control task in a self-designed fixed-wing UAV environment with demonstrations generated by a PID controller and human play data (Wang et al., 2023a).
Dataset Splits No The paper describes how demonstrations are generated or obtained, and their characteristics, but does not explicitly provide training/validation/test dataset splits (e.g., percentages or absolute counts) for these datasets as needed to reproduce the experiment.
Hardware Specification No The paper does not provide specific details about the hardware used for experiments, such as GPU/CPU models, memory, or specific cloud instance types.
Software Dependencies No The paper mentions 'Imitation framework (Gleave et al., 2022)' and 'Stable Baselines3 framework (Raffin et al., 2021)' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes A fully connected network of size 256 256 is employed for Halfcheetah, Hopper, and Reach tasks. For attitude control, a 128 128 fully connected network is used for the first and second training iterations, and a larger architecture of 256 256 128 128 64 for the third and fourth iterations. The Tanh activation function is applied throughout all training processes. Table 7. Parameters used in BC, Table 8. Parameters used in PPO.