reproducibilityindex.ai

Interactive Learning from Policy-Dependent Human Feedback

Authors: James MacGlashan, Mark K. Ho, Robert Loftin, Bei Peng, Guan Wang, David L. Roberts, Matthew E. Taylor, Michael L. Littman

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present empirical results that show this assumption to be false whether human trainers give a positive or negative feedback for a decision is inﬂuenced by the learner s current policy. Finally, we demonstrate that COACH can successfully learn multiple behaviors on a physical robot.
Researcher Affiliation	Collaboration	1Cogitai 2Brown University 3North Carolina State University 4Washington State University. Correspondence to: James Mac Glashan <james@cogitai.com>.
Pseudocode	Yes	Algorithm 1 Real-time COACH
Open Source Code	No	The paper states 'COACH was able to successfully learn all ﬁve behaviors and a video showing its learning is available online at https://vid.me/3h2s.' This link points to a video, not source code for the methodology. No other statements or links for code release are provided.
Open Datasets	No	The paper uses a human-subject experiment (AMT participants teaching a virtual dog) and a custom grid world simulation for its experiments, which are not named public datasets with access information. It also mentions training on a Turtle Bot robot. No publicly available datasets are used or provided.
Dataset Splits	No	The paper mentions 'training phase' in the human-subject experiment and uses 'train' in the context of learning algorithms, but it does not provide explicit details on dataset splits (e.g., percentages, sample counts) for training, validation, or test sets for reproduction. The experiments are either human-interactive or simulated environments.
Hardware Specification	No	The paper mentions training 'on a Turtle Bot robot' and that 'The Turtle Bot is a mobile base with two degrees of freedom that senses the world from a Kinect camera.' These are general product names, but specific hardware details like CPU, GPU models, or memory specifications are not provided.
Software Dependencies	No	The paper describes the algorithms and their parameters but does not list specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, specific libraries).
Experiment Setup	Yes	For COACH parameters, we used a softmax parameterized policy, where each action preference value was a linear function of the image features, plus tanh(θa), where θa is a learnable parameter for action a, providing a preference in the absence of any stimulus. We used two eligibility traces with λ = 0.95 for feedback +1 and -1, and λ = 0.9999 for feedback +4. The feedback-action delay d was set to 6, which is 0.198 seconds. For Q-learning, we used discount factor γ = 0.99 and learning rate α = 0.2. For TAMER, we used α = 0.2. For COACH in the grid world, we used β = 1 and α = 0.05.