SPO: Sequential Monte Carlo Policy Optimisation
Authors: Matthew Macfarlane, Edan Toledo, Donal Byrne, Paul Duckworth, Alexandre Laterre
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate statistically significant improvements in performance relative to model-free and model-based baselines across both continuous and discrete environments. 5 Experiments In this section, we focus on three main areas of analysis. First, we demonstrate the improved performance of SPO in terms of episode returns, relative to both model-free and model-based algorithms. We conduct evaluations across a suite of common environments for both continuous control and discrete action spaces. Secondly, we examine the scaling behaviour of SPO during training, showing that asymptotic performance scales with particle count and depth. |
| Researcher Affiliation | Collaboration | Matthew V Macfarlane University of Amsterdam m.v.macfarlane@uva.nl Edan Toledo Insta Deep Donal Byrne Insta Deep Paul Duckworth Insta Deep Alexandre Laterre Insta Deep |
| Pseudocode | Yes | Algorithm 1 SMC q target estimation (timestep t) and Algorithm 2 SPO Algorithm |
| Open Source Code | Yes | Inference code and checkpoints are available at https://github.com/instadeepai/spo |
| Open Datasets | Yes | For continuous control we evaluate on the Brax [26] benchmark environments of: Ant, Half Cheetah, and Humanoid. For discrete environments, we evaluate on Boxoban [35] (a specific instance of Sokoban), commonly used to assess planning methods, and Rubik s Cube... The datasets employed in this study are publicly accessible at https://github.com/google-deepmind/boxoban-levels. |
| Dataset Splits | Yes | The datasets employed in this study are publicly accessible at https://github.com/google-deepmind/boxoban-levels. These datasets are split into different levels of difficulty, which are categorised in Table 1. ... Table 1: Summary of Boxoban dataset levels ... Unfiltered-Train 900k Unfiltered-Validation 100k Unfiltered-Test 1k |
| Hardware Specification | Yes | Training was performed using a mixture of Google v4-8 and v3-8 TPUs. Each experiment was run using a single TPU and only v3-8 TPUs were used to compare wall-clock time. |
| Software Dependencies | No | All implementations were conducted using JAX [14], with off-policy algorithms leveraging Flashbax [81]. (No version numbers provided for JAX or Flashbax in the text.) |
| Experiment Setup | Yes | D.3 Hyperparameters Table 3: SPO Hyperparameters for Continuous and Discrete Environments Parameter Continuous Environments Discrete Environments Actor & Critic Learning Rate 3e 4 3e 4 Dual Learning Rate 1e 3 3e 4 Discount Factor 0.99 0.99 GAE Lambda 0.95 0.95 Replay Buffer Size 6.5e4 6.5e4 Batch Size 32 64 Batch Sequence Length 32 17 Max Grad Norm 0.5 0.5 Number of Epochs 128 16 Number of Envs 1024 768 Rollout Length 32 21 τ (Target Smoothing) 5e 3 5e 3 Number of Particles 16 16 Search Horizon 4 4 Resample Period 4 4 Initial η 10 0.5 Initial α 0.5 Initial αµ 10 Initial αΣ 500 ϵη 0.2 0.5 ϵα 1e 3 ϵαµ 5e 2 ϵαΣ 5e 4 Dirichlet Alpha 0.03 Root Exploration Weight 0.25 |