Adaptive Thompson Sampling Stacks for Memory Bounded Open-Loop Planning

Authors: Thomy Phan, Thomas Gabor, Robert Müller, Christoph Roch, Claudia Linnhoff-Popien

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically test SYMBOL in four large POMDP benchmark problems to demonstrate its effectiveness and robustness w.r.t. the choice of hyperparameters and evaluate its adaptive memory consumption. We also compare its performance with other open-loop planning algorithms and POMCP.
Researcher Affiliation Academia Thomy Phan , Thomas Gabor , Robert M uller , Christoph Roch and Claudia Linnhoff-Popien LMU Munich {thomy.phan, thomas.gabor, robert.mueller, christoph.roch, linnhoff}@ifi.lmu.de
Pseudocode Yes Algorithm 1 Generalized Thompson Sampling; Algorithm 2 SYMBOL Planning
Open Source Code Yes Code is available at https://github.com/thomyphan/planning
Open Datasets Yes We tested SYMBOL in different POMDP benchmark problems [Silver and Veness, 2010; Somani et al., 2013]. The Rock Sample(n,k) problem simulates an agent moving in an n n grid, which contains k rocks [Smith and Simmons, 2004]. In Battleship five ships of size 1, 2, 3, 4, and 5 respectively are randomly placed into a 10 10 grid, where the agent has to sink all ships without knowing their actual positions [Silver and Veness, 2010]. Poc Man is a partially observable version of Pac Man [Silver and Veness, 2010].
Dataset Splits No The paper uses standard benchmark problems but does not specify explicit training/validation/test dataset splits (e.g., percentages, sample counts, or citations to predefined splits) for reproduction beyond using the entire problem setup for evaluation.
Hardware Specification No The paper mentions computation time and memory constraints but does not provide specific details about the hardware used for experiments, such as GPU/CPU models, memory amounts, or cloud instance types.
Software Dependencies No The paper describes the algorithms implemented (e.g., SYMBOL, POMCP, UCB1, Thompson Sampling) but does not provide specific version numbers for software dependencies or libraries (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup Yes We always set γ as proposed in [Silver and Veness, 2010]. For POMCP and POOLUCT we set the UCB1 exploration constant c to the reward range of each domain as proposed in [Silver and Veness, 2010]. Since we assume no additional domain knowledge, we focus on uninformative priors with µ0 = 0, α0 = 1, and λ0 = 0.01 [Bai et al., 2014]. With this setting, β0 controls the degree of initial exploration during the planning phase, thus we only vary β0 for POOLTS, POSTS, and SYMBOL. We also experimented with the convergence tolerance κ {2, 4, 8, 16, 32}. Fig. 2a-d show the performance of SYMBOL with ϵ {3.2, 6.4, 12.8} compared to all other approaches described in Section 5.2. The results are shown in Fig. 3 for nb = 4096, T = 100, β0 {100, 500, 1000} for POOLTS, POSTS, and SYMBOL, ϵ = 6.4, and κ = 8.