Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Sequential Monte Carlo for Policy Optimization in Continuous POMDPs

Authors: Hany Abdulsamad, Sahel Mohammad Iqbal, Simo Sarkka

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our algorithm across standard continuous POMDP benchmarks, where existing methods struggle to act under uncertainty.
Researcher Affiliation Academia 1University of Amsterdam 2Aalto University EMAIL EMAIL
Pseudocode Yes Algorithm 1: Particle POMDP Policy Optimization (P3O). Algorithm 2: Particle filter to sample from ΚT (z0:T , a0:T 1; ϕ).
Open Source Code Yes Complete experimental details are given in Appendix D and implementations of all algorithms are available at https://github.com/Sahel13/particle-pomdp.
Open Datasets Yes We validate our approach against a closed-form linear-quadratic Gaussian (LQG) benchmark. Control Tasks: Here we evaluate on two classical control tasks: stochastic, partially-observed variants of the pendulum and cart-pole swing-up tasks. Light-Dark: Next, we consider a continuous light-dark navigation task [Platt et al., 2010, Van Den Berg et al., 2012]. Triangulation: In our final experiment, we consider an active triangulation task, in which the agent must reach the origin in a two-dimensional plane relying solely on heading measurements [Tse and Bar-Shalom, 1975].
Dataset Splits No The paper does not provide explicit training/test/validation dataset splits. It describes reinforcement learning tasks that involve interaction with environments, not static datasets. The closest information related to data usage is "We report the average return using 1024 trajectory rollouts and plot the mean and standard error over 10 training seeds." which refers to evaluation rather than dataset partitioning.
Hardware Specification Yes All our experiments were carried out on an NVIDIA A100 80GB GPU.
Software Dependencies No The paper mentions software components like "SAC implementation", "Clean RL [Huang et al., 2022]", and "Brax [Freeman et al., 2021]" but does not specify version numbers for these or other key software components (e.g., Python, PyTorch, CUDA).
Experiment Setup Yes For all algorithms considered (including P3O), we use a particle filter with 32 particles to track the belief state. Additionally, for P3O, the outer particle filter has 128 particles. For P3O, we use an additional slew rate penalty in the reward function that penalizes large changes in action in adjacent time steps. In Figure 4, we vary the number of particles used in the Feynman Kac and belief filters. In Figure 5, we examine the impact of the temperature parameter η. Our SAC implementation and the hyperparameters chosen for training are based on the implementations in Clean RL [Huang et al., 2022] and Brax [Freeman et al., 2021].