Learning Robust Policy against Disturbance in Transition Dynamics via State-Conservative Policy Optimization
Authors: Yufei Kuang, Miao Lu, Jie Wang, Qi Zhou, Bin Li, Houqiang Li7247-7254
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments in several robot control tasks demonstrate that SCPO learns robust policies against the disturbance in transition dynamics. |
| Researcher Affiliation | Academia | 1CAS Key Laboratory of Technology in GIPAS, University of Science and Technology of China 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center {yfkuang, lumiao, zhouqida}@mail.ustc.edu.cn {jiewangx, binli, lihq}@ustc.edu.cn |
| Pseudocode | Yes | We show the pseudo code of SC-SAC in Algorithm 3. |
| Open Source Code | No | No explicit statement or link regarding open-source code. |
| Open Datasets | Yes | In this section, we conduct experiments on the SCPO-based algorithm SC-SAC in several Mu Jo Co benchmarks (Todorov, Erez, and Tassa 2012) to evaluate its performance. |
| Dataset Splits | No | No explicit details on dataset splits (train/validation/test) with specific percentages or counts. The paper mentions "We train policies for 200k steps (i.e., 200 epochs) in Inverted Double Pendulum-v2 and 1000k steps (i.e., 1000 epochs) in other tasks." |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) are mentioned for the experimental setup. |
| Software Dependencies | No | The paper mentions using "SAC" and "Mu Jo Co benchmarks" but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | Hyperparameter Setting The hyperparameter ϵ serves as a regularization coefficient in SCPO. By the definition in Section 4.2, larger ϵ implies higher intensity of the disturbance considered in SC-MDP. However, too large ϵ can lead to suboptimal policies and thus degraded performance. Thus, we tune the hyperparameter ϵ in the Hopper-v2 task by grid search and find that it achieves the best performance when ϵ = 0.005. We then set ϵ = 0.005 for all the tasks in our experiments. See Section 6.2 for sensitivity analysis. Implementation and Evaluation Settings We normalize the observations for both SAC and SC-SAC in all tasks. We keep all the parameters in SC-SAC the same as those in original SAC. We train policies for 200k steps (i.e., 200 epochs) in Inverted Double Pendulum-v2 and 1000k steps (i.e., 1000 epochs) in other tasks. We train policy for each task with 5 random seeds. |