Bayes-Adaptive Simulation-based Search with Value Function Approximation
Authors: Arthur Guez, Nicolas Heess, David Silver, Peter Dayan
NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we show that our approach requires considerably fewer simulations to find good policies than BAMCP in a (discrete) bandit task and two continuous control tasks with a Gaussian process prior over the dynamics [5, 6]. |
| Researcher Affiliation | Collaboration | Arthur Guez ,1,2 Nicolas Heess2 David Silver2 Peter Dayan1 aguez@google.com 1Gatsby Unit, UCL 2Google Deep Mind |
| Pseudocode | Yes | Algorithm 1: Bayes-Adaptive simulation-based search with root sampling |
| Open Source Code | No | The paper does not explicitly state that source code for the described methodology is being released or provide a link to it. |
| Open Datasets | No | The paper describes the setup for its experimental tasks (Bernoulli bandit, height map navigation, pendulum swing-up) but does not provide specific access information (link, DOI, repository, or formal citation with author/year for public availability) for any dataset used. |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, or detailed splitting methodology) for training, validation, and testing. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiments. |
| Experiment Setup | Yes | We consider the scenario γ = 0.99, p0 = 0.2 for which the optimal decision, and the posterior mean decision frequently differ. We compare BAMCP against BAFA on this domain, planning over 75 steps with a discount of 0.98. We use conventional parameter settings for the pendulum [5], a mass of 1kg, a length of 1m, a maximum torque of 5Nm, and coefficient of friction of 0.05 kg m2 / s. The state of the pendulum is s = (θ, θ). Each time-step corresponds to 0.05s, γ = 0.98, and the reward function is R(s) = cos(θ). The histogram is computed with 100 runs with (a) K = 10000, or (b) K = 15000, simulations for each algorithm, horizon T = 50 and (for BAFA) M = 50 particles. |