A Strategy-Aware Technique for Learning Behaviors from Discrete Human Feedback

Authors: Robert Loftin, James MacGlashan, Bei Peng, Matthew Taylor, Michael Littman, Jeff Huang, David Roberts

AAAI 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Results from user studies show that humans use a variety of training strategies in practice and both algorithms can learn a contextual bandit task faster than algorithms that treat the feedback as numeric. Simulated trainers are also employed to evaluate the algorithms in both contextual bandit and sequential decision-making tasks with similar results.
Researcher Affiliation Academia Robert Loftin North Carolina State University rtloftin@ncsu.edu James Mac Glashan Brown University james macglashan@brown.edu Bei Peng Washington State University bei.peng@wsu.edu Matthew E. Taylor Washington State University taylorm@eecs.wsu.edu Michael L. Littman Brown University mlittman@cs.brown.edu Jeff Huang Brown University jeff@cs.brown.edu David L. Roberts North Carolina State University robertsd@csc.ncsu.edu
Pseudocode Yes Algorithm 1 The SABL algorithm. ... Algorithm 2 The I-SABL algorithm.
Open Source Code No The paper does not provide any explicit statements about the release of source code or links to a code repository for the methodology described.
Open Datasets No The paper describes user studies and simulated trainer experiments but does not refer to any publicly available datasets with concrete access information (e.g., specific links, DOIs, or citations to external public datasets). The user study involved human participants interacting with a described task.
Dataset Splits No The paper mentions performance criteria (50%, 75%, 100% correctness) for the learning agents but does not specify any dataset splits (e.g., train/validation/test percentages or counts) for reproducibility.
Hardware Specification No The paper describes the experimental setup and tasks but does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments.
Software Dependencies No The paper describes the algorithms and their implementation conceptually but does not list any specific software dependencies with version numbers.
Experiment Setup Yes To evaluate their performance when learning from human trainers, we ran an online study in which participants trained learning agents using either SABL (with µ+ = µ = 0.1), I-SABL, M 0, or M+0... We tested each learning agent on tasks consisting of 2, 5, 10, 15 and 20 observations and 2, 3, or 4 actions. ... The trainer s error rate ϵ=0.2, matching SABL and I-SABL s assumed value. ... Trainer strategies were defined by {µ+, µ } = {0.1, 0.1} for the balanced feedback strategy, {µ+, µ } = {0.1, 0.9} for the reward-focused strategy, and {µ+, µ } = {0.9, 0.1} for the punishment-focused strategy. ... For all strategies, ϵ = 0.05.