Interactive Learning from Policy-Dependent Human Feedback
Authors: James MacGlashan, Mark K. Ho, Robert Loftin, Bei Peng, Guan Wang, David L. Roberts, Matthew E. Taylor, Michael L. Littman
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present empirical results that show this assumption to be false whether human trainers give a positive or negative feedback for a decision is influenced by the learner s current policy. Finally, we demonstrate that COACH can successfully learn multiple behaviors on a physical robot. |
| Researcher Affiliation | Collaboration | 1Cogitai 2Brown University 3North Carolina State University 4Washington State University. Correspondence to: James Mac Glashan <james@cogitai.com>. |
| Pseudocode | Yes | Algorithm 1 Real-time COACH |
| Open Source Code | No | The paper states 'COACH was able to successfully learn all five behaviors and a video showing its learning is available online at https://vid.me/3h2s.' This link points to a video, not source code for the methodology. No other statements or links for code release are provided. |
| Open Datasets | No | The paper uses a human-subject experiment (AMT participants teaching a virtual dog) and a custom grid world simulation for its experiments, which are not named public datasets with access information. It also mentions training on a Turtle Bot robot. No publicly available datasets are used or provided. |
| Dataset Splits | No | The paper mentions 'training phase' in the human-subject experiment and uses 'train' in the context of learning algorithms, but it does not provide explicit details on dataset splits (e.g., percentages, sample counts) for training, validation, or test sets for reproduction. The experiments are either human-interactive or simulated environments. |
| Hardware Specification | No | The paper mentions training 'on a Turtle Bot robot' and that 'The Turtle Bot is a mobile base with two degrees of freedom that senses the world from a Kinect camera.' These are general product names, but specific hardware details like CPU, GPU models, or memory specifications are not provided. |
| Software Dependencies | No | The paper describes the algorithms and their parameters but does not list specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, specific libraries). |
| Experiment Setup | Yes | For COACH parameters, we used a softmax parameterized policy, where each action preference value was a linear function of the image features, plus tanh(θa), where θa is a learnable parameter for action a, providing a preference in the absence of any stimulus. We used two eligibility traces with λ = 0.95 for feedback +1 and -1, and λ = 0.9999 for feedback +4. The feedback-action delay d was set to 6, which is 0.198 seconds. For Q-learning, we used discount factor γ = 0.99 and learning rate α = 0.2. For TAMER, we used α = 0.2. For COACH in the grid world, we used β = 1 and α = 0.05. |