Interactive Learning from Policy-Dependent Human Feedback

Authors: James MacGlashan, Mark K. Ho, Robert Loftin, Bei Peng, Guan Wang, David L. Roberts, Matthew E. Taylor, Michael L. Littman

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present empirical results that show this assumption to be false whether human trainers give a positive or negative feedback for a decision is influenced by the learner s current policy. Finally, we demonstrate that COACH can successfully learn multiple behaviors on a physical robot.
Researcher Affiliation Collaboration 1Cogitai 2Brown University 3North Carolina State University 4Washington State University. Correspondence to: James Mac Glashan <james@cogitai.com>.
Pseudocode Yes Algorithm 1 Real-time COACH
Open Source Code No The paper states 'COACH was able to successfully learn all five behaviors and a video showing its learning is available online at https://vid.me/3h2s.' This link points to a video, not source code for the methodology. No other statements or links for code release are provided.
Open Datasets No The paper uses a human-subject experiment (AMT participants teaching a virtual dog) and a custom grid world simulation for its experiments, which are not named public datasets with access information. It also mentions training on a Turtle Bot robot. No publicly available datasets are used or provided.
Dataset Splits No The paper mentions 'training phase' in the human-subject experiment and uses 'train' in the context of learning algorithms, but it does not provide explicit details on dataset splits (e.g., percentages, sample counts) for training, validation, or test sets for reproduction. The experiments are either human-interactive or simulated environments.
Hardware Specification No The paper mentions training 'on a Turtle Bot robot' and that 'The Turtle Bot is a mobile base with two degrees of freedom that senses the world from a Kinect camera.' These are general product names, but specific hardware details like CPU, GPU models, or memory specifications are not provided.
Software Dependencies No The paper describes the algorithms and their parameters but does not list specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, specific libraries).
Experiment Setup Yes For COACH parameters, we used a softmax parameterized policy, where each action preference value was a linear function of the image features, plus tanh(θa), where θa is a learnable parameter for action a, providing a preference in the absence of any stimulus. We used two eligibility traces with λ = 0.95 for feedback +1 and -1, and λ = 0.9999 for feedback +4. The feedback-action delay d was set to 6, which is 0.198 seconds. For Q-learning, we used discount factor γ = 0.99 and learning rate α = 0.2. For TAMER, we used α = 0.2. For COACH in the grid world, we used β = 1 and α = 0.05.