PODS: Policy Optimization via Differentiable Simulation
Authors: Miguel Angel Zamora Mora, Momchil Peychev, Sehoon Ha, Martin Vechev, Stelian Coros
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate the policy optimization scheme that we propose, we apply it to a set of control problems that require payloads to be manipulated via stiff or elastic cables. We have chosen to focus our attention on this class of high-precision dynamic manipulation tasks for the following reasons: The results of our experiments confrm our theoretical derivations and show that our method consistently out performs two state-of-the-art (SOTA) model-free RL al gorithms, Proximal Policy Optimization(PPO) (Wang et al., 2019) and Soft Actor-Critic(SAC) (Haarnoja et al., 2018), as well as the model-based approach of Backpropagation Through Time (BPTT). Although our policy optimization scheme (PODS) can be interleaved within the algorithmic framework of most RL methods (e.g. by periodically updat ing the means of the probability distributions represented by stochastic policies), we focused our efforts on evaluat ing it in isolation to pinpoint the benefts it brings. This allowed us to show that with minimal hyper-parameter tun ing, the second order update rule that we derive provides an excellent balance between rapid, reliable convergence and computational complexity. In conjunction with the con tinued evolution of accurate differentiable simulators, our method promises to signifcantly improve the process of learning control policies using RL. |
| Researcher Affiliation | Academia | 1Department of Computer Science, ETH Zurich, Zurich, Switzerland 2School of Interactive Computing, Georgia Institute of Technology, Georgia, USA. |
| Pseudocode | Yes | Algorithm 1 PODS: Policy Optimization via Differentiable Simulators |
| Open Source Code | No | The paper does not provide an explicit statement about releasing code or a link to a code repository for the described methodology. |
| Open Datasets | No | The paper describes the creation of custom environments and tasks, but it does not specify public access to a dataset or provide concrete access information. |
| Dataset Splits | No | The paper mentions collecting 'k = 4000 rollouts for each epoch' and evaluating rewards from '1000 rollouts started from a test bed of unseen initial states', but it does not provide specific percentages or counts for training, validation, and test splits needed for reproducibility, nor does it cite a predefined standard split for its custom environments. |
| Hardware Specification | Yes | All experiments were run using a desktop PC with an Intel Core TM i7-8700K CPU and a Ge Force GTX 1080 Ti graphics card. |
| Software Dependencies | No | The paper mentions using a 'simulation engine that follows closely the description in Zimmermann et al. (2019)' and a 'BDF2 integration scheme', as well as 'PPO (Wang et al., 2019) and Soft Actor-Critic(SAC) (Haarnoja et al., 2018)' for comparison, and 'standard implementations provided in Achiam (2018)'. However, it does not provide specific version numbers for any of the software dependencies, such as the simulation engine, specific libraries, or frameworks like PyTorch or TensorFlow. |
| Experiment Setup | Yes | For the experiments we present in the next section, we collected k = 4000 rollouts for each epoch, and we performed 50 gradient descent steps on Lθ for each policy optimization iteration. For all the environments, the action space describes instan taneous velocities of the handles, which are restricted to remain within physically reasonable limits. We fne tuned hyper parameters to get the best performance we could, and otherwise ran standard implementations provided in Achiam (2018). |