Bayes-Adaptive Simulation-based Search with Value Function Approximation

Authors: Arthur Guez, Nicolas Heess, David Silver, Peter Dayan

NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we show that our approach requires considerably fewer simulations to find good policies than BAMCP in a (discrete) bandit task and two continuous control tasks with a Gaussian process prior over the dynamics [5, 6].
Researcher Affiliation Collaboration Arthur Guez ,1,2 Nicolas Heess2 David Silver2 Peter Dayan1 aguez@google.com 1Gatsby Unit, UCL 2Google Deep Mind
Pseudocode Yes Algorithm 1: Bayes-Adaptive simulation-based search with root sampling
Open Source Code No The paper does not explicitly state that source code for the described methodology is being released or provide a link to it.
Open Datasets No The paper describes the setup for its experimental tasks (Bernoulli bandit, height map navigation, pendulum swing-up) but does not provide specific access information (link, DOI, repository, or formal citation with author/year for public availability) for any dataset used.
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, or detailed splitting methodology) for training, validation, and testing.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiments.
Experiment Setup Yes We consider the scenario γ = 0.99, p0 = 0.2 for which the optimal decision, and the posterior mean decision frequently differ. We compare BAMCP against BAFA on this domain, planning over 75 steps with a discount of 0.98. We use conventional parameter settings for the pendulum [5], a mass of 1kg, a length of 1m, a maximum torque of 5Nm, and coefficient of friction of 0.05 kg m2 / s. The state of the pendulum is s = (θ, θ). Each time-step corresponds to 0.05s, γ = 0.98, and the reward function is R(s) = cos(θ). The histogram is computed with 100 runs with (a) K = 10000, or (b) K = 15000, simulations for each algorithm, horizon T = 50 and (for BAFA) M = 50 particles.