Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Breaking the Performance Ceiling in Reinforcement Learning requires Inference Strategies
Authors: Felix Chalumeau, Daniel Rajaonarivonivelomanantsoa, Ruan John de Kock, Juan Formanek, Sasha Abramowitz, Omayma Mahjoub, Wiem Khlifi, Simon Du Toit, Louay Nessir, Refiloe Shabe, Arnol Fokam, Siddarth Singh, Ulrich Armel Mbou Sob, Arnu Pretorius
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental data and code are available at https://sites.google.com/view/inference-strategies-rl. Our main result is striking: we can obtain up to a 126% and, on average, a 45% improvement over the previous state-of-the-art across 17 tasks, using only a couple seconds of extra wall-clock time during execution. We also demonstrate promising compute scaling properties, supported by over 60k experiments, making it the largest study on inference strategies for complex RL to date. |
| Researcher Affiliation | Collaboration | Felix Chalumeau 1 Daniel Rajaonarivonivelomanantsoa 1,2 Ruan de Kock 1 Claude Formanek1 Sasha Abramowitz1 Omayma Mahjoub1 Wiem Khlifi1 Simon Du Toit1 Louay Ben Nessir1 Refiloe Shabe1 Noah De Nicola1 Arnol Fokam1 Siddarth Singh1 Ulrich Mbou Sob1 Arnu Pretorius1,2 1Insta Deep 2Stellenbosch University |
| Pseudocode | Yes | In this section, we provide algorithmic descriptions for the inference strategies used in the paper, showing how the policy is used at inference time, under the time and budget constraints. Stochastic sampling is explained in Algorithm 1, SGBS in Algorithm 2, online fine-tuning in Algorithm 3 and COMPASS in Algorithm 4. |
| Open Source Code | Yes | Our experimental data and code are available at https://sites.google.com/view/inference-strategies-rl. |
| Open Datasets | Yes | Each environment (illustrated on Fig. 3) introduces distinct challenges that contribute to its complexity. RWARE requires agents to coordinate in order to pick up and deliver packages without collision and has a very sparse reward signal. In SMACv2 tasks, a team cooperates in real-time combat against enemies across diverse scenarios with randomised generation. Connector models the routing of a printed circuit board where agents must connect to designated targets without crossing paths. All three environments feature combinatorial and high-dimensional action spaces, partial observability and the need for tightly coordinated behaviours, making these 17 tasks a compelling test-bed for complex RL with desirable properties modelling aspects of real-world tasks. We use JAX-based implementations of Connector and RWARE from Jumanji (Bonnet et al., 2023) and for SMAC from Jax MARL (Rutherford et al., 2023). |
| Dataset Splits | No | We assume a distribution of problem instances D, that can be sampled from during training. The joint policy πθ, parameterised by θ, is used to construct a solution sequentially by taking joint actions conditioned on the joint observation at each timestep. We use RL to train this policy to maximise the expected return obtained when building solutions to instances drawn from the distribution D over a horizon H: J(πθ) = ED h PH t=0 γt R(st, at) i . At inference time, a new problem instance ρ is drawn from a distribution D (possibly different than D). This describes the sampling process, not explicit train/test/validation splits of a fixed dataset. |
| Hardware Specification | Yes | We use the same fixed hardware for all our experiments, namely a NVIDIA-A100-SXM4-80GB GPU. |
| Software Dependencies | No | All algorithmic implementations are built by extending the JAX-based research library, Mava (de Kock et al., 2021). Our version of CMA-ES used for searching COMPASS s latent space comes from the JAX-based quality diversity library, QDAX (Chalumeau et al., 2024). Both these libraries are built to leverage the Deep Mind JAX ecosystem (Deep Mind et al., 2020) and use Flax (Heek et al., 2024) for building neural networks, optax for optimisers and orbax for model checkpointing. |
| Experiment Setup | Yes | Hyperparameters To train the base policies, we re-use the hyperparameters reported in Mahjoub et al. (2025), which have been optimised for our tasks. For the inference strategies, we follow recom- mendations from the literature (Choo et al., 2022; Chalumeau et al., 2023b). All hyperparameters choices are reported in Appendix D. |