Truly Deterministic Policy Optimization
Authors: Ehsan Saleh, Saba Ghaffari, Tim Bretl, Matthew West
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we describe two novel robotic control environments one with non-local rewards in the frequency domain and the other with a long horizon (8000 timesteps) for which our policy gradient method (TDPO) significantly outperforms existing methods (PPO, TRPO, DDPG, and TD3). |
| Researcher Affiliation | Academia | Ehsan Saleh1, Saba Ghaffari1, Timothy Bretl1,2, Matthew West3 1Department of Computer Science 2Department of Aerospace Engineering 3Department of Mechanical Science and Engineering University of Illinois Urbana-Champaign ehsans2,sabag2,tbretl,mwest@illinois.edu |
| Pseudocode | Yes | Algorithm 1 Truly Deterministic Policy Optimization (TDPO) and Algorithm 2 Deterministic Vine (De Vine) Policy Advantage Estimator |
| Open Source Code | Yes | Our implementation with all the experimental settings and a video of the physical hardware test is available at https://github.com/ehsansaleh/tdpo. |
| Open Datasets | No | The paper describes experiments conducted in simulated robotic control environments (simple pendulum, quadruped leg) using the MuJoCo software, implying data is generated through simulation rather than using a pre-existing public dataset. No specific public dataset with concrete access information is mentioned. |
| Dataset Splits | No | The paper conducts experiments in a reinforcement learning setting where data is generated through interaction with a simulated environment (trajectories). It does not explicitly define traditional train/validation/test dataset splits (e.g., 80/10/10 percentage) for a static dataset. Hyperparameter optimization is mentioned, which implicitly uses some form of validation, but no explicit data splits are provided. |
| Hardware Specification | No | The paper mentions that the research utilized 'Blue Waters sustained-petascale computing project' but does not provide specific hardware details such as GPU models, CPU types, or memory amounts in the provided text. It states 'This was left to the appendix' for compute and resources, but the appendix is not available. |
| Software Dependencies | No | The paper mentions using 'Mu Jo Co software for simulation' and lists various HPO implementations (Optuna, Bayesian Optimization, Scikit-Optimize, GPy Opt, Pro SRS) but does not provide specific version numbers for these software dependencies in the main text. The author checklist states that version details are in the code repository, but not directly in the paper. |
| Experiment Setup | Yes | The control loop rate is 4000 Hz and the rollout length is 2 s, resulting in a horizon of 8000 steps. A discount factor of γ = 0.99975 was chosen for all methods, where (1 γ) 1 is half the trajectory length. Similarly, the GAE factors for PPO and TRPO were scaled up to 0.99875 and 0.9995, respectively, in proportion to the trajectory length. We also systematically perform Hyper-Parameter Optimization (HPO) on all methods... All our experiments include 95% confidence intervals, and are ran for 100 independent random seeds. |