Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Generalizing Policy Advice with Gaussian Process Bandits for Dynamic Skill Improvement

Authors: Jared Glover, Charlotte Zhu

AAAI 2014 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a ping-pong-playing robot that learns to improve its swings with human advice. Our method learns a reward function over the joint space of task and policy parameters T P, so the robot can explore policy space more intelligently in a way that trades off exploration vs. exploitation to maximize the total cumulative reward over time. Multimodal stochastic polices can also easily be learned with this approach when the reward function is multimodal in the policy parameters. We extend the recently-developed Gaussian Process Bandit Optimization framework to include exploration-bias advice from human domain experts, using a novel algorithm called Exploration Bias with Directional Advice (EBDA).
Researcher Affiliation	Academia	Jared Glover and Charlotte Zhu Computer Science and Artiﬁcial Intelligence Laboratory Massachusetts Institute of Technology Email: EMAIL
Pseudocode	Yes	Algorithm 1 EBDA
Open Source Code	No	No explicit statement about open-sourcing the code for the methodology.
Open Datasets	No	The paper describes generating its own data through human interaction with the robot: 'We ran each algorithm for 100 trials of a human ping pong expert hitting balls to the robot in a predetermined sequence... If an incoming ball trajectory from the human differed signiﬁcantly from the desired ball trajectory, the human was prompted to hit another ball to the robot.' No information about public availability of this dataset is provided.
Dataset Splits	No	The paper describes an episodic learning task and experimental trials (e.g., '100 trials of a human ping pong expert hitting balls to the robot'), but it does not specify explicit training, validation, or test dataset splits in the conventional sense of splitting a fixed dataset. The learning is online.
Hardware Specification	No	The paper describes the robot hardware (7-dof Barrett WAM arm, Silicon Video SV-640M cameras, Kinect) but does not provide specifications for the computing hardware (e.g., CPU, GPU) used to run the algorithms and experiments.
Software Dependencies	No	The paper mentions 'Open NI' for Kinect tracking, but does not specify any version numbers for this or any other software libraries or dependencies used in the implementation of the algorithms.
Experiment Setup	Yes	The policy parameter vector for each learning experiment was θ = (x, vx, vz, ny, nz): a subset of the hit plan parameters from section . x was deﬁned as a displacement from the default hit plan’s x position, and controlled whether the ball was hit before or after its highest point. Policy parameters were bounded from ( .1, 1, 0, .1, .5) to (.1, 2, .5, .1, .5). Rewards were given by the experimenter of 0 or 1 depending on whether the ball was succefully hit back to the other side of the table. In addition, a coach’s pseudo-reward of 0.5 was added if the coach liked the way the robot hit the ball, for a total reward in each trial between 0 and 1.5.