Policy Search with High-Dimensional Context Variables
Authors: Voot Tangkaratt, Herke van Hoof, Simone Parisi, Gerhard Neumann, Jan Peters, Masashi Sugiyama
AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the proposed method on three problems. We start by studying C-MORE behavior in a scenario where we know the true reward model and the true low-dimensional context. Subsequently, we focus our attention on two simulated robotic ball hitting tasks. In the first task, a toy 2-Do F planar robot arm has to hit a ball placed on a plane. In the second task, a simulated 6-Do F robot arm has to hit a ball placed in a three-dimensional space. |
| Researcher Affiliation | Academia | Voot Tangkaratt The University of Tokyo, 113-0033 Tokyo, Japan voot@ms.k.u-tokyo.ac.jp Herke van Hoof Mc Gill University, 3480 Rue University, Montreal, Canada Technical University of Darmstadt, 64289 Darmstadt, Germany Simone Parisi Technical University of Darmstadt, 64289 Darmstadt, Germany simone@robot-learning.de Gerhard Neumann University of Lincoln, LN6 7TS Lincoln, United Kingdom Technical University of Darmstadt, 64289 Darmstadt, Germany geri@robot-learning.de Jan Peters MPI for Intelligent Systems, 72076 Tuebingen, Germany Technical University of Darmstadt, 64289 Darmstadt, Germany mail@jan-peters.net Masashi Sugiyama The University of Tokyo, 277-8561 Chiba, Japan RIKEN AIP Center, 351-0198 Saitama, Japan sugi@k.u-tokyo.ac.jp |
| Pseudocode | Yes | Algorithm 1: C-MORE |
| Open Source Code | No | The paper does not provide any specific links or statements about the availability of its source code. |
| Open Datasets | No | The paper uses a "synthetic task with known ground truth" and "robotic ball hitting tasks based on camera images" where the images were collected or generated by the authors. No concrete access information (link, DOI, formal citation to a public dataset) is provided for these datasets. |
| Dataset Splits | Yes | For C-MORE Nuc. Norm, C-MORE LASSO and C-MORE PCA, we perform 5-fold cross-validation every 100 policy updates to choose the values of regularization parameter for nuclear norm, regularization parameter for ℓ1 norm, and dimension dz, respectively. |
| Hardware Specification | No | The paper describes simulated robot arms and tasks but does not specify the hardware (e.g., CPU, GPU models) on which these simulations were run. |
| Software Dependencies | No | The paper mentions software like IPOPT and APG but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We set γ = 0.99 and H0 = 150. The sampling Gaussian distribution is initialized with random mean and covariance Q = 10,000I. For learning, we collect 35 new samples and keeps track of the samples collected during the last 20 iterations to stabilize the policy update. The learning is performed for a maximum of 100 iterations. If the KL divergence is lower than 0.1, then the learning is considered to be converged and the policy is not updated anymore. |