reproducibilityindex.ai

PODS: Policy Optimization via Differentiable Simulation

Authors: Miguel Angel Zamora Mora, Momchil Peychev, Sehoon Ha, Martin Vechev, Stelian Coros

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate the policy optimization scheme that we propose, we apply it to a set of control problems that require payloads to be manipulated via stiff or elastic cables. We have chosen to focus our attention on this class of high-precision dynamic manipulation tasks for the following reasons: The results of our experiments confrm our theoretical derivations and show that our method consistently out performs two state-of-the-art (SOTA) model-free RL al gorithms, Proximal Policy Optimization(PPO) (Wang et al., 2019) and Soft Actor-Critic(SAC) (Haarnoja et al., 2018), as well as the model-based approach of Backpropagation Through Time (BPTT). Although our policy optimization scheme (PODS) can be interleaved within the algorithmic framework of most RL methods (e.g. by periodically updat ing the means of the probability distributions represented by stochastic policies), we focused our efforts on evaluat ing it in isolation to pinpoint the benefts it brings. This allowed us to show that with minimal hyper-parameter tun ing, the second order update rule that we derive provides an excellent balance between rapid, reliable convergence and computational complexity. In conjunction with the con tinued evolution of accurate differentiable simulators, our method promises to signifcantly improve the process of learning control policies using RL.
Researcher Affiliation	Academia	1Department of Computer Science, ETH Zurich, Zurich, Switzerland 2School of Interactive Computing, Georgia Institute of Technology, Georgia, USA.
Pseudocode	Yes	Algorithm 1 PODS: Policy Optimization via Differentiable Simulators
Open Source Code	No	The paper does not provide an explicit statement about releasing code or a link to a code repository for the described methodology.
Open Datasets	No	The paper describes the creation of custom environments and tasks, but it does not specify public access to a dataset or provide concrete access information.
Dataset Splits	No	The paper mentions collecting 'k = 4000 rollouts for each epoch' and evaluating rewards from '1000 rollouts started from a test bed of unseen initial states', but it does not provide specific percentages or counts for training, validation, and test splits needed for reproducibility, nor does it cite a predefined standard split for its custom environments.
Hardware Specification	Yes	All experiments were run using a desktop PC with an Intel Core TM i7-8700K CPU and a Ge Force GTX 1080 Ti graphics card.
Software Dependencies	No	The paper mentions using a 'simulation engine that follows closely the description in Zimmermann et al. (2019)' and a 'BDF2 integration scheme', as well as 'PPO (Wang et al., 2019) and Soft Actor-Critic(SAC) (Haarnoja et al., 2018)' for comparison, and 'standard implementations provided in Achiam (2018)'. However, it does not provide specific version numbers for any of the software dependencies, such as the simulation engine, specific libraries, or frameworks like PyTorch or TensorFlow.
Experiment Setup	Yes	For the experiments we present in the next section, we collected k = 4000 rollouts for each epoch, and we performed 50 gradient descent steps on Lθ for each policy optimization iteration. For all the environments, the action space describes instan taneous velocities of the handles, which are restricted to remain within physically reasonable limits. We fne tuned hyper parameters to get the best performance we could, and otherwise ran standard implementations provided in Achiam (2018).