Where to Add Actions in Human-in-the-Loop Reinforcement Learning
Authors: Travis Mandel, Yun-En Liu, Emma Brunskill, Zoran Popovi_
AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate ELI on a variety of simulated domains adapted from the literature, including domains with over a million actions and domains where the simulated experts change over time. We find ELI demonstrates excellent empirical performance, even in settings where the synthetic experts are quite poor. |
| Researcher Affiliation | Collaboration | 1Center for Game Science, Computer Science & Engineering, University of Washington, Seattle, WA 2Enlearn TM, Seattle, WA 3School of Computer Science, Carnegie Mellon University, Pittsburgh, PA |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We examine performance on three environments adapted from the literature. Riverswim is a chain MDP with 6 states, 5 outcomes, and 2 ground actions per state that requires efficient exploration (Osband, Russo, and Van Roy 2013). Marblemaze (Asmuth et al. 2009; Russell and Norvig 1994) is a gridworld MDP with 36 states, 5 outcomes, and 4 ground actions per state. We use a modified version of the Large Action Task (Sallans and Hinton 2004) |
| Dataset Splits | No | The paper describes using simulated environments (Riverswim, Marblemaze, Large Action Task) and discusses running the agent within these environments, but it does not provide specific details on training, validation, or test dataset splits. |
| Hardware Specification | No | The paper states 'We show results using both simulated environments and simulated humans', but it does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run these simulations or experiments. |
| Software Dependencies | No | The paper mentions using Posterior Sampling Reinforcement Learning (PSRL) and other algorithmic approaches like UCRL2, MBIE, Thompson Sampling, and BOSS, but it does not provide specific software dependencies or their version numbers (e.g., programming language versions, library versions). |
| Experiment Setup | Yes | Each state starts with a single action... and every 20 episodes a new action is added at the state the agent selects. We transform standard domains to this new setting by sampling a new (or initial) action uniformly at random... We use Posterior Sampling Reinforcement Learning (PSRL)... We constrain all methods to only select states the agent has visited before. We set δ to 0.05... We set J = 10... For the prior we let αm = βm = 1 for all m... |