reproducibilityindex.ai

SPO: Sequential Monte Carlo Policy Optimisation

Authors: Matthew Macfarlane, Edan Toledo, Donal Byrne, Paul Duckworth, Alexandre Laterre

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate statistically significant improvements in performance relative to model-free and model-based baselines across both continuous and discrete environments. 5 Experiments In this section, we focus on three main areas of analysis. First, we demonstrate the improved performance of SPO in terms of episode returns, relative to both model-free and model-based algorithms. We conduct evaluations across a suite of common environments for both continuous control and discrete action spaces. Secondly, we examine the scaling behaviour of SPO during training, showing that asymptotic performance scales with particle count and depth.
Researcher Affiliation	Collaboration	Matthew V Macfarlane University of Amsterdam m.v.macfarlane@uva.nl Edan Toledo Insta Deep Donal Byrne Insta Deep Paul Duckworth Insta Deep Alexandre Laterre Insta Deep
Pseudocode	Yes	Algorithm 1 SMC q target estimation (timestep t) and Algorithm 2 SPO Algorithm
Open Source Code	Yes	Inference code and checkpoints are available at https://github.com/instadeepai/spo
Open Datasets	Yes	For continuous control we evaluate on the Brax [26] benchmark environments of: Ant, Half Cheetah, and Humanoid. For discrete environments, we evaluate on Boxoban [35] (a specific instance of Sokoban), commonly used to assess planning methods, and Rubik s Cube... The datasets employed in this study are publicly accessible at https://github.com/google-deepmind/boxoban-levels.
Dataset Splits	Yes	The datasets employed in this study are publicly accessible at https://github.com/google-deepmind/boxoban-levels. These datasets are split into different levels of difficulty, which are categorised in Table 1. ... Table 1: Summary of Boxoban dataset levels ... Unfiltered-Train 900k Unfiltered-Validation 100k Unfiltered-Test 1k
Hardware Specification	Yes	Training was performed using a mixture of Google v4-8 and v3-8 TPUs. Each experiment was run using a single TPU and only v3-8 TPUs were used to compare wall-clock time.
Software Dependencies	No	All implementations were conducted using JAX [14], with off-policy algorithms leveraging Flashbax [81]. (No version numbers provided for JAX or Flashbax in the text.)
Experiment Setup	Yes	D.3 Hyperparameters Table 3: SPO Hyperparameters for Continuous and Discrete Environments Parameter Continuous Environments Discrete Environments Actor & Critic Learning Rate 3e 4 3e 4 Dual Learning Rate 1e 3 3e 4 Discount Factor 0.99 0.99 GAE Lambda 0.95 0.95 Replay Buffer Size 6.5e4 6.5e4 Batch Size 32 64 Batch Sequence Length 32 17 Max Grad Norm 0.5 0.5 Number of Epochs 128 16 Number of Envs 1024 768 Rollout Length 32 21 τ (Target Smoothing) 5e 3 5e 3 Number of Particles 16 16 Search Horizon 4 4 Resample Period 4 4 Initial η 10 0.5 Initial α 0.5 Initial αµ 10 Initial αΣ 500 ϵη 0.2 0.5 ϵα 1e 3 ϵαµ 5e 2 ϵαΣ 5e 4 Dirichlet Alpha 0.03 Root Exploration Weight 0.25