Efficient Bayesian Clustering for Reinforcement Learning
Authors: Travis Mandel, Yun-En Liu, Emma Brunskill, Zoran Popovic
IJCAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | TCRL-Theoretic achieves near-optimal Bayesian regret bounds while consistently improving over a standard Bayesian exploration approach. TCRL-Relaxed is guaranteed to converge to acting optimally, and empirically outperforms state-of-the-art Bayesian clustering algorithms across a variety of simulated domains, even in cases where no states are similar. |
| Researcher Affiliation | Collaboration | Travis Mandel,1 Yun-En Liu,2 Emma Brunskill,3 and Zoran Popovi c1,2 1Center for Game Science, Computer Science & Engineering, University of Washington, Seattle, WA 2Enlearn TM, Seattle, WA 3School of Computer Science, Carnegie Mellon University, Pittsburgh, PA |
| Pseudocode | Yes | Algorithm 1 TCRL-Theoretic; Algorithm 2 TCRL-Relaxed; Algorithm 3 Subroutine for TCRL-Relaxed; Algorithm 4 Subroutine for TCRL-Theoretic |
| Open Source Code | No | The paper does not provide any concrete statements or links regarding the availability of its source code. |
| Open Datasets | Yes | Riverswim [Strehl and Littman, 2008]; Marble Maze [Asmuth et al., 2009; Russell et al., 1994]; Six Arms [Strehl and Littman, 2008]; 200-state gridworld featuring one-dimensional walls [Johns and Mahadevan, 2007] |
| Dataset Splits | No | The paper conducts experiments in reinforcement learning environments where agents interact over episodes rather than using static datasets with explicit train/validation/test splits. Therefore, the concept of fixed dataset splits for validation is not directly applicable or described. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments (e.g., CPU, GPU models, memory specifications). |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies or libraries used in their implementation or experiments. |
| Experiment Setup | Yes | We used a horizon of 20 and defined 5 relative outcomes for moving left and right or staying with some reward. (Riverswim); We used a horizon of 30 and a set of 5 outcomes denoting whether the agent moved in each cardinal direction or hit a wall (Marble Maze); We used a horizon of 10 and 14 relative observations (Six Arms); We chose the reward to be 100 for reaching the goal and -1 for each step taken, and since the problem was harder we used a longer horizon of 50 and averaged over 20 (instead of 100) runs. (200-state environment); We choose the value of 0.5 recommended3 by Asmuth et al. 2009. (CRP concentration parameter for MCMC); We used 500 iterations in Riverswim (recommended for Chain, a similar environment), and 100 as recommended for Marble Maze. (MCMC iterations) |