Generalizing Policy Advice with Gaussian Process Bandits for Dynamic Skill Improvement
Authors: Jared Glover, Charlotte Zhu
AAAI 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a ping-pong-playing robot that learns to improve its swings with human advice. Our method learns a reward function over the joint space of task and policy parameters T P, so the robot can explore policy space more intelligently in a way that trades off exploration vs. exploitation to maximize the total cumulative reward over time. Multimodal stochastic polices can also easily be learned with this approach when the reward function is multimodal in the policy parameters. We extend the recently-developed Gaussian Process Bandit Optimization framework to include exploration-bias advice from human domain experts, using a novel algorithm called Exploration Bias with Directional Advice (EBDA). |
| Researcher Affiliation | Academia | Jared Glover and Charlotte Zhu Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Email: {jglov,charz}@mit.edu |
| Pseudocode | Yes | Algorithm 1 EBDA |
| Open Source Code | No | No explicit statement about open-sourcing the code for the methodology. |
| Open Datasets | No | The paper describes generating its own data through human interaction with the robot: 'We ran each algorithm for 100 trials of a human ping pong expert hitting balls to the robot in a predetermined sequence... If an incoming ball trajectory from the human differed significantly from the desired ball trajectory, the human was prompted to hit another ball to the robot.' No information about public availability of this dataset is provided. |
| Dataset Splits | No | The paper describes an episodic learning task and experimental trials (e.g., '100 trials of a human ping pong expert hitting balls to the robot'), but it does not specify explicit training, validation, or test dataset splits in the conventional sense of splitting a fixed dataset. The learning is online. |
| Hardware Specification | No | The paper describes the robot hardware (7-dof Barrett WAM arm, Silicon Video SV-640M cameras, Kinect) but does not provide specifications for the computing hardware (e.g., CPU, GPU) used to run the algorithms and experiments. |
| Software Dependencies | No | The paper mentions 'Open NI' for Kinect tracking, but does not specify any version numbers for this or any other software libraries or dependencies used in the implementation of the algorithms. |
| Experiment Setup | Yes | The policy parameter vector for each learning experiment was θ = (x, vx, vz, ny, nz): a subset of the hit plan parameters from section . x was defined as a displacement from the default hit plan’s x position, and controlled whether the ball was hit before or after its highest point. Policy parameters were bounded from ( .1, 1, 0, .1, .5) to (.1, 2, .5, .1, .5). Rewards were given by the experimenter of 0 or 1 depending on whether the ball was succefully hit back to the other side of the table. In addition, a coach’s pseudo-reward of 0.5 was added if the coach liked the way the robot hit the ball, for a total reward in each trial between 0 and 1.5. |