Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Efficient Bayesian Clustering for Reinforcement Learning
Authors: Travis Mandel, Yun-En Liu, Emma Brunskill, Zoran Popovic
IJCAI 2016 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | TCRL-Theoretic achieves near-optimal Bayesian regret bounds while consistently improving over a standard Bayesian exploration approach. TCRL-Relaxed is guaranteed to converge to acting optimally, and empirically outperforms state-of-the-art Bayesian clustering algorithms across a variety of simulated domains, even in cases where no states are similar. |
| Researcher Affiliation | Collaboration | Travis Mandel,1 Yun-En Liu,2 Emma Brunskill,3 and Zoran Popovi c1,2 1Center for Game Science, Computer Science & Engineering, University of Washington, Seattle, WA 2Enlearn TM, Seattle, WA 3School of Computer Science, Carnegie Mellon University, Pittsburgh, PA |
| Pseudocode | Yes | Algorithm 1 TCRL-Theoretic; Algorithm 2 TCRL-Relaxed; Algorithm 3 Subroutine for TCRL-Relaxed; Algorithm 4 Subroutine for TCRL-Theoretic |
| Open Source Code | No | The paper does not provide any concrete statements or links regarding the availability of its source code. |
| Open Datasets | Yes | Riverswim [Strehl and Littman, 2008]; Marble Maze [Asmuth et al., 2009; Russell et al., 1994]; Six Arms [Strehl and Littman, 2008]; 200-state gridworld featuring one-dimensional walls [Johns and Mahadevan, 2007] |
| Dataset Splits | No | The paper conducts experiments in reinforcement learning environments where agents interact over episodes rather than using static datasets with explicit train/validation/test splits. Therefore, the concept of fixed dataset splits for validation is not directly applicable or described. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments (e.g., CPU, GPU models, memory specifications). |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies or libraries used in their implementation or experiments. |
| Experiment Setup | Yes | We used a horizon of 20 and defined 5 relative outcomes for moving left and right or staying with some reward. (Riverswim); We used a horizon of 30 and a set of 5 outcomes denoting whether the agent moved in each cardinal direction or hit a wall (Marble Maze); We used a horizon of 10 and 14 relative observations (Six Arms); We chose the reward to be 100 for reaching the goal and -1 for each step taken, and since the problem was harder we used a longer horizon of 50 and averaged over 20 (instead of 100) runs. (200-state environment); We choose the value of 0.5 recommended3 by Asmuth et al. 2009. (CRP concentration parameter for MCMC); We used 500 iterations in Riverswim (recommended for Chain, a similar environment), and 100 as recommended for Marble Maze. (MCMC iterations) |