SPO: Sequential Monte Carlo Policy Optimisation

Authors: Matthew Macfarlane, Edan Toledo, Donal Byrne, Paul Duckworth, Alexandre Laterre

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate statistically significant improvements in performance relative to model-free and model-based baselines across both continuous and discrete environments. 5 Experiments In this section, we focus on three main areas of analysis. First, we demonstrate the improved performance of SPO in terms of episode returns, relative to both model-free and model-based algorithms. We conduct evaluations across a suite of common environments for both continuous control and discrete action spaces. Secondly, we examine the scaling behaviour of SPO during training, showing that asymptotic performance scales with particle count and depth.
Researcher Affiliation Collaboration Matthew V Macfarlane University of Amsterdam m.v.macfarlane@uva.nl Edan Toledo Insta Deep Donal Byrne Insta Deep Paul Duckworth Insta Deep Alexandre Laterre Insta Deep
Pseudocode Yes Algorithm 1 SMC q target estimation (timestep t) and Algorithm 2 SPO Algorithm
Open Source Code Yes Inference code and checkpoints are available at https://github.com/instadeepai/spo
Open Datasets Yes For continuous control we evaluate on the Brax [26] benchmark environments of: Ant, Half Cheetah, and Humanoid. For discrete environments, we evaluate on Boxoban [35] (a specific instance of Sokoban), commonly used to assess planning methods, and Rubik s Cube... The datasets employed in this study are publicly accessible at https://github.com/google-deepmind/boxoban-levels.
Dataset Splits Yes The datasets employed in this study are publicly accessible at https://github.com/google-deepmind/boxoban-levels. These datasets are split into different levels of difficulty, which are categorised in Table 1. ... Table 1: Summary of Boxoban dataset levels ... Unfiltered-Train 900k Unfiltered-Validation 100k Unfiltered-Test 1k
Hardware Specification Yes Training was performed using a mixture of Google v4-8 and v3-8 TPUs. Each experiment was run using a single TPU and only v3-8 TPUs were used to compare wall-clock time.
Software Dependencies No All implementations were conducted using JAX [14], with off-policy algorithms leveraging Flashbax [81]. (No version numbers provided for JAX or Flashbax in the text.)
Experiment Setup Yes D.3 Hyperparameters Table 3: SPO Hyperparameters for Continuous and Discrete Environments Parameter Continuous Environments Discrete Environments Actor & Critic Learning Rate 3e 4 3e 4 Dual Learning Rate 1e 3 3e 4 Discount Factor 0.99 0.99 GAE Lambda 0.95 0.95 Replay Buffer Size 6.5e4 6.5e4 Batch Size 32 64 Batch Sequence Length 32 17 Max Grad Norm 0.5 0.5 Number of Epochs 128 16 Number of Envs 1024 768 Rollout Length 32 21 τ (Target Smoothing) 5e 3 5e 3 Number of Particles 16 16 Search Horizon 4 4 Resample Period 4 4 Initial η 10 0.5 Initial α 0.5 Initial αµ 10 Initial αΣ 500 ϵη 0.2 0.5 ϵα 1e 3 ϵαµ 5e 2 ϵαΣ 5e 4 Dirichlet Alpha 0.03 Root Exploration Weight 0.25