PIPPS: Flexible Model-Based Policy Search Robust to the Curse of Chaos

Authors: Paavo Parmas, Carl Edward Rasmussen, Jan Peters, Kenji Doya

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We performed experiments with two purposes: 1. To explain why RP gradients are not sufficient (Section 4.1). 2. To show that our newly developed methods can match up to PILCO in terms of learning efficiency (Section 4.2). The results are in Tables 1 and 2, and in Figure 4.
Researcher Affiliation Academia 1Okinawa Institute of Science and Technology Graduate University, Okinawa, Japan 2University of Cambridge, Cambridge, UK 3TU Darmstadt, Darmstadt, Germany 4Max Planck Institute for Intelligent Systems, T ubingen, Germany.
Pseudocode Yes Algorithm 1 Analytic moment matching based trajectory prediction and policy evaluation (used in PILCO) Algorithm 2 Total Propagation Algorithm (used in PIPPS for evaluating the gradient)
Open Source Code No The paper does not contain any explicit statements or links indicating that the source code for the methodology described in this paper is publicly available.
Open Datasets Yes We performed learning tasks from a recent PILCO paper (Deisenroth et al., 2015): cart-pole swing-up and balancing, and unicycle balancing.
Dataset Splits No The paper describes the learning process, policy evaluations, and trial evaluations but does not provide specific details on dataset splits (e.g., percentages or counts) for training, validation, or testing, nor does it reference predefined splits with citations for dynamic environments.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. It refers to simulations but without hardware specifications.
Software Dependencies No The paper mentions models and algorithms like 'Gaussian process' and 'RMSprop' but does not specify any software names with version numbers for reproducibility (e.g., Python, PyTorch, TensorFlow, or specific library versions).
Experiment Setup Yes The optimizer was run for 600 policy evaluations between each trial. The SGD learning rate, and momentum parameters were α = 5 10 4 and γ = 0.9. The episode lengths were 3s for the cart-pole, and 2s for the unicycle. The control frequencies were 10Hz. The costs were of the type 1 exp( (x t)T Q(x t)), where t is the target. The outputs from the policies π(x) were constrained by a saturation function: sat(u) = 9 sin(u)/8 + sin(3u)/8, where u = π(x). One experiment consisted of (1; 5) random trials followed by (15; 30) learned trials for the cart and unicycle tasks respectively. Each experiment was repeated 100 times and averaged. Each trial was evaluated by running the policy 30 times, and averaging, though note that this was performed only for evaluation purposes the algorithms only had access to 1 trial.