Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
A Strategy-Aware Technique for Learning Behaviors from Discrete Human Feedback
Authors: Robert Loftin, James MacGlashan, Bei Peng, Matthew Taylor, Michael Littman, Jeff Huang, David Roberts
AAAI 2014 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Results from user studies show that humans use a variety of training strategies in practice and both algorithms can learn a contextual bandit task faster than algorithms that treat the feedback as numeric. Simulated trainers are also employed to evaluate the algorithms in both contextual bandit and sequential decision-making tasks with similar results. |
| Researcher Affiliation | Academia | Robert Loftin North Carolina State University EMAIL James Mac Glashan Brown University james EMAIL Bei Peng Washington State University EMAIL Matthew E. Taylor Washington State University EMAIL Michael L. Littman Brown University EMAIL Jeff Huang Brown University EMAIL David L. Roberts North Carolina State University EMAIL |
| Pseudocode | Yes | Algorithm 1 The SABL algorithm. ... Algorithm 2 The I-SABL algorithm. |
| Open Source Code | No | The paper does not provide any explicit statements about the release of source code or links to a code repository for the methodology described. |
| Open Datasets | No | The paper describes user studies and simulated trainer experiments but does not refer to any publicly available datasets with concrete access information (e.g., specific links, DOIs, or citations to external public datasets). The user study involved human participants interacting with a described task. |
| Dataset Splits | No | The paper mentions performance criteria (50%, 75%, 100% correctness) for the learning agents but does not specify any dataset splits (e.g., train/validation/test percentages or counts) for reproducibility. |
| Hardware Specification | No | The paper describes the experimental setup and tasks but does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper describes the algorithms and their implementation conceptually but does not list any specific software dependencies with version numbers. |
| Experiment Setup | Yes | To evaluate their performance when learning from human trainers, we ran an online study in which participants trained learning agents using either SABL (with µ+ = µ = 0.1), I-SABL, M 0, or M+0... We tested each learning agent on tasks consisting of 2, 5, 10, 15 and 20 observations and 2, 3, or 4 actions. ... The trainer s error rate ϵ=0.2, matching SABL and I-SABL s assumed value. ... Trainer strategies were defined by {µ+, µ } = {0.1, 0.1} for the balanced feedback strategy, {µ+, µ } = {0.1, 0.9} for the reward-focused strategy, and {µ+, µ } = {0.9, 0.1} for the punishment-focused strategy. ... For all strategies, ϵ = 0.05. |