Data-Efficient Reinforcement Learning in Continuous State-Action Gaussian-POMDPs

Authors: Rowan McAllister, Carl Edward Rasmussen

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test our method on the cartpole swing-up task, which involves nonlinear dynamics and requires nonlinear control. In the final sections we show experimental results of our proposed algorithm handling observation noise better than competing algorithms.
Researcher Affiliation Academia Rowan Thomas Mc Allister Department of Engineering Cambridge University Cambridge, CB2 1PZ rtm26@cam.ac.uk Carl Edward Rasmussen Department of Engineering University of Cambridge Cambridge, CB2 1PZ cer54@cam.ac.uk
Pseudocode Yes Algorithm 1 PILCO
Open Source Code No However, we modify (using PILCO’s source code http://mlg.eng.cam.ac.uk/pilco/) two subroutines to extend PILCO... (This link refers to PILCO's original source code, not the modified code developed by the authors for this paper.)
Open Datasets No We test our algorithm on the cartpole swing-up problem (shown in Appendix A), a benchmark for comparing controllers of nonlinear dynamical systems. We generate a single dataset by running the baseline PILCO algorithm for 11 episodes (totalling 22 seconds of system interaction). (The paper uses a physics simulator to generate its own data, but does not provide access information for a publicly available or open dataset.)
Dataset Splits No No explicit mention of validation dataset splits was found.
Hardware Specification No The paper specifies parameters of the simulated physical system (e.g., 'cart mass of mc = 0.5kg', 'pole of length l = 0.2m'), but does not provide specific details about the computer hardware (e.g., GPU/CPU models, memory) used for running the simulations or training the models.
Software Dependencies No The paper mentions various methods and models (e.g., 'Gaussian Processes', 'PILCO') and cites related work, but does not provide specific version numbers for any software dependencies or libraries used in the implementation.
Experiment Setup Yes The cartpole swing-up problem... We experiment using a physics simulator... The use a cart mass of mc = 0.5kg. A zero-order hold controller applies horizontal forces to the cart within range [ 10, 10]N. The policy is a linear combination of 100 radial basis functions. Friction resists the cart s motion with damping coefficient b = 0.1Ns/m. Connected to the cart is a pole of length l = 0.2m and mass mp = 0.5kg located at its endpoint, which swings due to gravity s acceleration g = 9.82m/s2. ...the time discretisation is t = 1/30s. ...We both randomlyinitialise the system and set the initial belief of the system according to B0| 1 N(M0| 1, V0| 1) where M0| 1 δ([0, π, 0, 0] ) and V 1/2 0| 1 = diag([0.2m, 0.2rad, 0.2m/s, 0.2rad/s]). The camera s noise standard deviation is: (Σϵ)1/2 = diag([0.03m, 0.03rad, 0.03 t m/s, 0.03 t rad/s]). Each episode has a two second time horizon (60 timesteps). The cost function we impose is 1 exp 1 2d2/σ2 c where σc = 0.25m and d2 is the squared Euclidean distance between the pendulum s end point and its goal. Training the GP dynamics model involved N = 660 data points, M = 50 inducing points under a sparse GP Fully Independent Training Conditional (FITC) [2], P = 100 policy RBF centroids, D = 4 state dimensions, F = 1 action dimensions, and T = 60 timestep horizon.