Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Sequential Monte Carlo for Policy Optimization in Continuous POMDPs
Authors: Hany Abdulsamad, Sahel Mohammad Iqbal, Simo Sarkka
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our algorithm across standard continuous POMDP benchmarks, where existing methods struggle to act under uncertainty. |
| Researcher Affiliation | Academia | 1University of Amsterdam 2Aalto University EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1: Particle POMDP Policy Optimization (P3O). Algorithm 2: Particle filter to sample from ΚT (z0:T , a0:T 1; Ï). |
| Open Source Code | Yes | Complete experimental details are given in Appendix D and implementations of all algorithms are available at https://github.com/Sahel13/particle-pomdp. |
| Open Datasets | Yes | We validate our approach against a closed-form linear-quadratic Gaussian (LQG) benchmark. Control Tasks: Here we evaluate on two classical control tasks: stochastic, partially-observed variants of the pendulum and cart-pole swing-up tasks. Light-Dark: Next, we consider a continuous light-dark navigation task [Platt et al., 2010, Van Den Berg et al., 2012]. Triangulation: In our final experiment, we consider an active triangulation task, in which the agent must reach the origin in a two-dimensional plane relying solely on heading measurements [Tse and Bar-Shalom, 1975]. |
| Dataset Splits | No | The paper does not provide explicit training/test/validation dataset splits. It describes reinforcement learning tasks that involve interaction with environments, not static datasets. The closest information related to data usage is "We report the average return using 1024 trajectory rollouts and plot the mean and standard error over 10 training seeds." which refers to evaluation rather than dataset partitioning. |
| Hardware Specification | Yes | All our experiments were carried out on an NVIDIA A100 80GB GPU. |
| Software Dependencies | No | The paper mentions software components like "SAC implementation", "Clean RL [Huang et al., 2022]", and "Brax [Freeman et al., 2021]" but does not specify version numbers for these or other key software components (e.g., Python, PyTorch, CUDA). |
| Experiment Setup | Yes | For all algorithms considered (including P3O), we use a particle filter with 32 particles to track the belief state. Additionally, for P3O, the outer particle filter has 128 particles. For P3O, we use an additional slew rate penalty in the reward function that penalizes large changes in action in adjacent time steps. In Figure 4, we vary the number of particles used in the Feynman Kac and belief filters. In Figure 5, we examine the impact of the temperature parameter η. Our SAC implementation and the hyperparameters chosen for training are based on the implementations in Clean RL [Huang et al., 2022] and Brax [Freeman et al., 2021]. |