reproducibilityindex.ai

Truly Deterministic Policy Optimization

Authors: Ehsan Saleh, Saba Ghaffari, Tim Bretl, Matthew West

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we describe two novel robotic control environments one with non-local rewards in the frequency domain and the other with a long horizon (8000 timesteps) for which our policy gradient method (TDPO) signiﬁcantly outperforms existing methods (PPO, TRPO, DDPG, and TD3).
Researcher Affiliation	Academia	Ehsan Saleh1, Saba Ghaffari1, Timothy Bretl1,2, Matthew West3 1Department of Computer Science 2Department of Aerospace Engineering 3Department of Mechanical Science and Engineering University of Illinois Urbana-Champaign ehsans2,sabag2,tbretl,mwest@illinois.edu
Pseudocode	Yes	Algorithm 1 Truly Deterministic Policy Optimization (TDPO) and Algorithm 2 Deterministic Vine (De Vine) Policy Advantage Estimator
Open Source Code	Yes	Our implementation with all the experimental settings and a video of the physical hardware test is available at https://github.com/ehsansaleh/tdpo.
Open Datasets	No	The paper describes experiments conducted in simulated robotic control environments (simple pendulum, quadruped leg) using the MuJoCo software, implying data is generated through simulation rather than using a pre-existing public dataset. No specific public dataset with concrete access information is mentioned.
Dataset Splits	No	The paper conducts experiments in a reinforcement learning setting where data is generated through interaction with a simulated environment (trajectories). It does not explicitly define traditional train/validation/test dataset splits (e.g., 80/10/10 percentage) for a static dataset. Hyperparameter optimization is mentioned, which implicitly uses some form of validation, but no explicit data splits are provided.
Hardware Specification	No	The paper mentions that the research utilized 'Blue Waters sustained-petascale computing project' but does not provide specific hardware details such as GPU models, CPU types, or memory amounts in the provided text. It states 'This was left to the appendix' for compute and resources, but the appendix is not available.
Software Dependencies	No	The paper mentions using 'Mu Jo Co software for simulation' and lists various HPO implementations (Optuna, Bayesian Optimization, Scikit-Optimize, GPy Opt, Pro SRS) but does not provide specific version numbers for these software dependencies in the main text. The author checklist states that version details are in the code repository, but not directly in the paper.
Experiment Setup	Yes	The control loop rate is 4000 Hz and the rollout length is 2 s, resulting in a horizon of 8000 steps. A discount factor of γ = 0.99975 was chosen for all methods, where (1 γ) 1 is half the trajectory length. Similarly, the GAE factors for PPO and TRPO were scaled up to 0.99875 and 0.9995, respectively, in proportion to the trajectory length. We also systematically perform Hyper-Parameter Optimization (HPO) on all methods... All our experiments include 95% conﬁdence intervals, and are ran for 100 independent random seeds.