Offline Evaluation of Online Reinforcement Learning Algorithms
Authors: Travis Mandel, Yun-En Liu, Emma Brunskill, Zoran Popović
AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments, including those that use data from a real educational domain, show these methods have different tradeoffs. |
| Researcher Affiliation | Collaboration | 1Center for Game Science, Computer Science & Engineering, University of Washington, Seattle, WA 2Enlearn TM, Seattle, WA 3School of Computer Science, Carnegie Mellon University, Pittsburgh, PA |
| Pseudocode | Yes | Algorithm 1 Queue-based Evaluator; Algorithm 2 Per-State Rejection Sampling Evaluator; Algorithm 3 Per-Episode Rejection Sampling Evaluator |
| Open Source Code | Yes | For details see the appendix (available at http://grail.cs.washington.edu/projects/nonstationaryeval). |
| Open Datasets | Yes | We collected a dataset of 11,550 players collected from a child-focused educational website, collected using a semi-uniform sampling policy. [Also mentions] Six Arms (Strehl and Littman 2004). |
| Dataset Splits | No | The paper mentions using a dataset for evaluation but does not specify explicit training, validation, or test splits by percentages or sample counts for reproducibility. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU, GPU models, or memory specifications) used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Posterior Sampling Reinforcement Learning (PSRL)' but does not provide specific version numbers for any software or libraries used in the experiments. |
| Experiment Setup | Yes | Here, we show results evaluating Posterior Sampling Reinforcement Learning (PSRL) ... The standard version of PSRL creates one deterministic policy each episode based on a single posterior sample; however, we can sample the posterior multiple times to create multiple policies and randomly choose between them at each step, which allows us to test our evaluators with more or less revealed randomness. ... PSRL run with 10 posterior samples. |