Efficient Bayesian Clustering for Reinforcement Learning

Authors: Travis Mandel, Yun-En Liu, Emma Brunskill, Zoran Popovic

IJCAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental TCRL-Theoretic achieves near-optimal Bayesian regret bounds while consistently improving over a standard Bayesian exploration approach. TCRL-Relaxed is guaranteed to converge to acting optimally, and empirically outperforms state-of-the-art Bayesian clustering algorithms across a variety of simulated domains, even in cases where no states are similar.
Researcher Affiliation Collaboration Travis Mandel,1 Yun-En Liu,2 Emma Brunskill,3 and Zoran Popovi c1,2 1Center for Game Science, Computer Science & Engineering, University of Washington, Seattle, WA 2Enlearn TM, Seattle, WA 3School of Computer Science, Carnegie Mellon University, Pittsburgh, PA
Pseudocode Yes Algorithm 1 TCRL-Theoretic; Algorithm 2 TCRL-Relaxed; Algorithm 3 Subroutine for TCRL-Relaxed; Algorithm 4 Subroutine for TCRL-Theoretic
Open Source Code No The paper does not provide any concrete statements or links regarding the availability of its source code.
Open Datasets Yes Riverswim [Strehl and Littman, 2008]; Marble Maze [Asmuth et al., 2009; Russell et al., 1994]; Six Arms [Strehl and Littman, 2008]; 200-state gridworld featuring one-dimensional walls [Johns and Mahadevan, 2007]
Dataset Splits No The paper conducts experiments in reinforcement learning environments where agents interact over episodes rather than using static datasets with explicit train/validation/test splits. Therefore, the concept of fixed dataset splits for validation is not directly applicable or described.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments (e.g., CPU, GPU models, memory specifications).
Software Dependencies No The paper does not provide specific version numbers for any software dependencies or libraries used in their implementation or experiments.
Experiment Setup Yes We used a horizon of 20 and defined 5 relative outcomes for moving left and right or staying with some reward. (Riverswim); We used a horizon of 30 and a set of 5 outcomes denoting whether the agent moved in each cardinal direction or hit a wall (Marble Maze); We used a horizon of 10 and 14 relative observations (Six Arms); We chose the reward to be 100 for reaching the goal and -1 for each step taken, and since the problem was harder we used a longer horizon of 50 and averaged over 20 (instead of 100) runs. (200-state environment); We choose the value of 0.5 recommended3 by Asmuth et al. 2009. (CRP concentration parameter for MCMC); We used 500 iterations in Riverswim (recommended for Chain, a similar environment), and 100 as recommended for Marble Maze. (MCMC iterations)