Learning Robust Policy against Disturbance in Transition Dynamics via State-Conservative Policy Optimization

Authors: Yufei Kuang, Miao Lu, Jie Wang, Qi Zhou, Bin Li, Houqiang Li7247-7254

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments in several robot control tasks demonstrate that SCPO learns robust policies against the disturbance in transition dynamics.
Researcher Affiliation Academia 1CAS Key Laboratory of Technology in GIPAS, University of Science and Technology of China 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center {yfkuang, lumiao, zhouqida}@mail.ustc.edu.cn {jiewangx, binli, lihq}@ustc.edu.cn
Pseudocode Yes We show the pseudo code of SC-SAC in Algorithm 3.
Open Source Code No No explicit statement or link regarding open-source code.
Open Datasets Yes In this section, we conduct experiments on the SCPO-based algorithm SC-SAC in several Mu Jo Co benchmarks (Todorov, Erez, and Tassa 2012) to evaluate its performance.
Dataset Splits No No explicit details on dataset splits (train/validation/test) with specific percentages or counts. The paper mentions "We train policies for 200k steps (i.e., 200 epochs) in Inverted Double Pendulum-v2 and 1000k steps (i.e., 1000 epochs) in other tasks."
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) are mentioned for the experimental setup.
Software Dependencies No The paper mentions using "SAC" and "Mu Jo Co benchmarks" but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes Hyperparameter Setting The hyperparameter ϵ serves as a regularization coefficient in SCPO. By the definition in Section 4.2, larger ϵ implies higher intensity of the disturbance considered in SC-MDP. However, too large ϵ can lead to suboptimal policies and thus degraded performance. Thus, we tune the hyperparameter ϵ in the Hopper-v2 task by grid search and find that it achieves the best performance when ϵ = 0.005. We then set ϵ = 0.005 for all the tasks in our experiments. See Section 6.2 for sensitivity analysis. Implementation and Evaluation Settings We normalize the observations for both SAC and SC-SAC in all tasks. We keep all the parameters in SC-SAC the same as those in original SAC. We train policies for 200k steps (i.e., 200 epochs) in Inverted Double Pendulum-v2 and 1000k steps (i.e., 1000 epochs) in other tasks. We train policy for each task with 5 random seeds.