A Strategy-Aware Technique for Learning Behaviors from Discrete Human Feedback
Authors: Robert Loftin, James MacGlashan, Bei Peng, Matthew Taylor, Michael Littman, Jeff Huang, David Roberts
AAAI 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Results from user studies show that humans use a variety of training strategies in practice and both algorithms can learn a contextual bandit task faster than algorithms that treat the feedback as numeric. Simulated trainers are also employed to evaluate the algorithms in both contextual bandit and sequential decision-making tasks with similar results. |
| Researcher Affiliation | Academia | Robert Loftin North Carolina State University rtloftin@ncsu.edu James Mac Glashan Brown University james macglashan@brown.edu Bei Peng Washington State University bei.peng@wsu.edu Matthew E. Taylor Washington State University taylorm@eecs.wsu.edu Michael L. Littman Brown University mlittman@cs.brown.edu Jeff Huang Brown University jeff@cs.brown.edu David L. Roberts North Carolina State University robertsd@csc.ncsu.edu |
| Pseudocode | Yes | Algorithm 1 The SABL algorithm. ... Algorithm 2 The I-SABL algorithm. |
| Open Source Code | No | The paper does not provide any explicit statements about the release of source code or links to a code repository for the methodology described. |
| Open Datasets | No | The paper describes user studies and simulated trainer experiments but does not refer to any publicly available datasets with concrete access information (e.g., specific links, DOIs, or citations to external public datasets). The user study involved human participants interacting with a described task. |
| Dataset Splits | No | The paper mentions performance criteria (50%, 75%, 100% correctness) for the learning agents but does not specify any dataset splits (e.g., train/validation/test percentages or counts) for reproducibility. |
| Hardware Specification | No | The paper describes the experimental setup and tasks but does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper describes the algorithms and their implementation conceptually but does not list any specific software dependencies with version numbers. |
| Experiment Setup | Yes | To evaluate their performance when learning from human trainers, we ran an online study in which participants trained learning agents using either SABL (with µ+ = µ = 0.1), I-SABL, M 0, or M+0... We tested each learning agent on tasks consisting of 2, 5, 10, 15 and 20 observations and 2, 3, or 4 actions. ... The trainer s error rate ϵ=0.2, matching SABL and I-SABL s assumed value. ... Trainer strategies were defined by {µ+, µ } = {0.1, 0.1} for the balanced feedback strategy, {µ+, µ } = {0.1, 0.9} for the reward-focused strategy, and {µ+, µ } = {0.9, 0.1} for the punishment-focused strategy. ... For all strategies, ϵ = 0.05. |