Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Learning Robust Policy against Disturbance in Transition Dynamics via State-Conservative Policy Optimization
Authors: Yufei Kuang, Miao Lu, Jie Wang, Qi Zhou, Bin Li, Houqiang Li7247-7254
AAAI 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments in several robot control tasks demonstrate that SCPO learns robust policies against the disturbance in transition dynamics. |
| Researcher Affiliation | Academia | 1CAS Key Laboratory of Technology in GIPAS, University of Science and Technology of China 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center EMAIL EMAIL |
| Pseudocode | Yes | We show the pseudo code of SC-SAC in Algorithm 3. |
| Open Source Code | No | No explicit statement or link regarding open-source code. |
| Open Datasets | Yes | In this section, we conduct experiments on the SCPO-based algorithm SC-SAC in several Mu Jo Co benchmarks (Todorov, Erez, and Tassa 2012) to evaluate its performance. |
| Dataset Splits | No | No explicit details on dataset splits (train/validation/test) with specific percentages or counts. The paper mentions "We train policies for 200k steps (i.e., 200 epochs) in Inverted Double Pendulum-v2 and 1000k steps (i.e., 1000 epochs) in other tasks." |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) are mentioned for the experimental setup. |
| Software Dependencies | No | The paper mentions using "SAC" and "Mu Jo Co benchmarks" but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | Hyperparameter Setting The hyperparameter ϵ serves as a regularization coefficient in SCPO. By the definition in Section 4.2, larger ϵ implies higher intensity of the disturbance considered in SC-MDP. However, too large ϵ can lead to suboptimal policies and thus degraded performance. Thus, we tune the hyperparameter ϵ in the Hopper-v2 task by grid search and find that it achieves the best performance when ϵ = 0.005. We then set ϵ = 0.005 for all the tasks in our experiments. See Section 6.2 for sensitivity analysis. Implementation and Evaluation Settings We normalize the observations for both SAC and SC-SAC in all tasks. We keep all the parameters in SC-SAC the same as those in original SAC. We train policies for 200k steps (i.e., 200 epochs) in Inverted Double Pendulum-v2 and 1000k steps (i.e., 1000 epochs) in other tasks. We train policy for each task with 5 random seeds. |