Policy Search in Reproducing Kernel Hilbert Space

Authors: Ngo Anh Vien, Peter Englert, Marc Toussaint

IJCAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental For evaluations, we use three simulated (PO)MDP reinforcement learning tasks, and a simulated PR2 s robotic manipulation task. The results demonstrate the effectiveness of the new RKHS policy search framework in comparison to plain RKHS actor-critic, episodic natural actor-critic, plain actor-critic, and Po WER approaches.
Researcher Affiliation Academia Ngo Anh Vien and Peter Englert and Marc Toussaint Machine Learning and Robotics Lab University of Stuttgart, Germany {vien.ngo, peter.englert, marc.toussaint}@ipvs.uni-stuttgart.de
Pseudocode Yes Algorithm 1 RKHS Policy Search Framework
Open Source Code No The paper does not provide an explicit statement or link for open-source code for the described methodology.
Open Datasets Yes We evaluate and compare on four domains (three learning in MDP and one learning in POMDP): a Toy domain, the benchmark Inverted Pendulum domain, a simulated PR2 robotic manipulation task, and the POMDP inverted Pendulum. ... The Toy domain was introduced by Lever and Stafford [Lever and Stafford, 2015]. It has a state space S 2 [ 4.0; 4.0], an action space A 2 [ 1.0; 1.0], the starting state s0 = 0, and a reward function r(s, a) = exp( |s 3|). The dynamics is st+1 = st + at + , where is a small Gaussian noise.
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits, specific percentages, or absolute sample counts for partitioning the data.
Hardware Specification No The paper does not provide specific details regarding the hardware used for running the experiments.
Software Dependencies No The paper mentions using 'the physics simulation ODE internally' but does not provide specific version numbers for this or any other software dependencies.
Experiment Setup Yes All methods use a discount factor γ = 0.99 for learning and report the cumulative discounted reward. ... The bandwidth of RBF kernels is chosen using the median trick. ... We set N = 10, H = 20. All controllers use 20 centres. ... The results are computed over 50 runs ... We use N = 10, H = 100 whose optimal return is roughly 46. We use 40 RBF centres in all controllers. ... The averaged performance is obtained over 15 runs. ... We use 150 centres for all algorithms and line-search over a grid of 50 step-sizes, and the length 20 for a history state. ... We set N = 10, H = 1. ... Figure 5 reports the average performance over 5 runs with a fixed step-size = 0.005 for RKHS-AC and RKHS-NAC. Each run starts from different initial poses of the robot.