reproducibilityindex.ai

Learning Robust Policy against Disturbance in Transition Dynamics via State-Conservative Policy Optimization

Authors: Yufei Kuang, Miao Lu, Jie Wang, Qi Zhou, Bin Li, Houqiang Li7247-7254

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments in several robot control tasks demonstrate that SCPO learns robust policies against the disturbance in transition dynamics.
Researcher Affiliation	Academia	1CAS Key Laboratory of Technology in GIPAS, University of Science and Technology of China 2Institute of Artiﬁcial Intelligence, Hefei Comprehensive National Science Center {yfkuang, lumiao, zhouqida}@mail.ustc.edu.cn {jiewangx, binli, lihq}@ustc.edu.cn
Pseudocode	Yes	We show the pseudo code of SC-SAC in Algorithm 3.
Open Source Code	No	No explicit statement or link regarding open-source code.
Open Datasets	Yes	In this section, we conduct experiments on the SCPO-based algorithm SC-SAC in several Mu Jo Co benchmarks (Todorov, Erez, and Tassa 2012) to evaluate its performance.
Dataset Splits	No	No explicit details on dataset splits (train/validation/test) with specific percentages or counts. The paper mentions "We train policies for 200k steps (i.e., 200 epochs) in Inverted Double Pendulum-v2 and 1000k steps (i.e., 1000 epochs) in other tasks."
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory) are mentioned for the experimental setup.
Software Dependencies	No	The paper mentions using "SAC" and "Mu Jo Co benchmarks" but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	Hyperparameter Setting The hyperparameter ϵ serves as a regularization coefﬁcient in SCPO. By the deﬁnition in Section 4.2, larger ϵ implies higher intensity of the disturbance considered in SC-MDP. However, too large ϵ can lead to suboptimal policies and thus degraded performance. Thus, we tune the hyperparameter ϵ in the Hopper-v2 task by grid search and ﬁnd that it achieves the best performance when ϵ = 0.005. We then set ϵ = 0.005 for all the tasks in our experiments. See Section 6.2 for sensitivity analysis. Implementation and Evaluation Settings We normalize the observations for both SAC and SC-SAC in all tasks. We keep all the parameters in SC-SAC the same as those in original SAC. We train policies for 200k steps (i.e., 200 epochs) in Inverted Double Pendulum-v2 and 1000k steps (i.e., 1000 epochs) in other tasks. We train policy for each task with 5 random seeds.