reproducibilityindex.ai

Iterative Regularized Policy Optimization with Imperfect Demonstrations

Authors: Gong Xudong, Feng Dawei, Kele Xu, Yuanzhao Zhai, Chengkang Yao, Weijia Wang, Bo Ding, Huaimin Wang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental validations conducted across widely used benchmarks and a novel fixed-wing UAV control task consistently demonstrate the effectiveness of IRPO in improving both the demonstration quality and the policy performance.
Researcher Affiliation	Collaboration	1College of Computer, National University of Defense Technology, Changsha, Hunan, China 2State Key Laboratory of Complex & Critical Software Environment, Changsha, Hunan, China 3Flight Automatic Control Research Institute, AVIC, Xian, Shanxi, China.
Pseudocode	Yes	Algorithm 1 Iterative Regularized Policy Optimization (IRPO) method
Open Source Code	Yes	Our code is available at https://github.com/Gong Xudong/IRPO.
Open Datasets	Yes	Articulated-body control: Halfcheetah and Hopper tasks on the Mu Jo Co physics engine with D4RL (Fu et al., 2020) datasets. Robotic Arm Control: Modified Reach task (Gallou edec et al., 2021) on the Bullet physics engine with demonstrations generated by a PID controller. Fixed-wing UAV Attitude Control: Attitude control task in a self-designed fixed-wing UAV environment with demonstrations generated by a PID controller and human play data (Wang et al., 2023a).
Dataset Splits	No	The paper describes how demonstrations are generated or obtained, and their characteristics, but does not explicitly provide training/validation/test dataset splits (e.g., percentages or absolute counts) for these datasets as needed to reproduce the experiment.
Hardware Specification	No	The paper does not provide specific details about the hardware used for experiments, such as GPU/CPU models, memory, or specific cloud instance types.
Software Dependencies	No	The paper mentions 'Imitation framework (Gleave et al., 2022)' and 'Stable Baselines3 framework (Raffin et al., 2021)' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	A fully connected network of size 256 256 is employed for Halfcheetah, Hopper, and Reach tasks. For attitude control, a 128 128 fully connected network is used for the first and second training iterations, and a larger architecture of 256 256 128 128 64 for the third and fourth iterations. The Tanh activation function is applied throughout all training processes. Table 7. Parameters used in BC, Table 8. Parameters used in PPO.