PIPPS: Flexible Model-Based Policy Search Robust to the Curse of Chaos
Authors: Paavo Parmas, Carl Edward Rasmussen, Jan Peters, Kenji Doya
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We performed experiments with two purposes: 1. To explain why RP gradients are not sufficient (Section 4.1). 2. To show that our newly developed methods can match up to PILCO in terms of learning efficiency (Section 4.2). The results are in Tables 1 and 2, and in Figure 4. |
| Researcher Affiliation | Academia | 1Okinawa Institute of Science and Technology Graduate University, Okinawa, Japan 2University of Cambridge, Cambridge, UK 3TU Darmstadt, Darmstadt, Germany 4Max Planck Institute for Intelligent Systems, T ubingen, Germany. |
| Pseudocode | Yes | Algorithm 1 Analytic moment matching based trajectory prediction and policy evaluation (used in PILCO) Algorithm 2 Total Propagation Algorithm (used in PIPPS for evaluating the gradient) |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that the source code for the methodology described in this paper is publicly available. |
| Open Datasets | Yes | We performed learning tasks from a recent PILCO paper (Deisenroth et al., 2015): cart-pole swing-up and balancing, and unicycle balancing. |
| Dataset Splits | No | The paper describes the learning process, policy evaluations, and trial evaluations but does not provide specific details on dataset splits (e.g., percentages or counts) for training, validation, or testing, nor does it reference predefined splits with citations for dynamic environments. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. It refers to simulations but without hardware specifications. |
| Software Dependencies | No | The paper mentions models and algorithms like 'Gaussian process' and 'RMSprop' but does not specify any software names with version numbers for reproducibility (e.g., Python, PyTorch, TensorFlow, or specific library versions). |
| Experiment Setup | Yes | The optimizer was run for 600 policy evaluations between each trial. The SGD learning rate, and momentum parameters were α = 5 10 4 and γ = 0.9. The episode lengths were 3s for the cart-pole, and 2s for the unicycle. The control frequencies were 10Hz. The costs were of the type 1 exp( (x t)T Q(x t)), where t is the target. The outputs from the policies π(x) were constrained by a saturation function: sat(u) = 9 sin(u)/8 + sin(3u)/8, where u = π(x). One experiment consisted of (1; 5) random trials followed by (15; 30) learned trials for the cart and unicycle tasks respectively. Each experiment was repeated 100 times and averaged. Each trial was evaluated by running the policy 30 times, and averaging, though note that this was performed only for evaluation purposes the algorithms only had access to 1 trial. |