Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Structured Reinforcement Learning for Combinatorial Decision-Making

Authors: Heiko Hoppe, Léo Baty, Louis Bouvier, Axel Parmentier, Maximilian Schiffer

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Across six environments with exogenous and endogenous uncertainty, SRL matches or surpasses the performance of unstructured RL and imitation learning on static tasks and improves over these baselines by up to 92% on dynamic problems, with improved stability and convergence speed.1
Researcher Affiliation Academia Heiko Hoppe1 Léo Baty2 Louis Bouvier2 Axel Parmentier2 Maximilian Schiffer1 1Technical University of Munich 2École des Ponts EMAIL EMAIL
Pseudocode Yes Algorithm 1 Structured Reinforcement Learning
Open Source Code Yes Our code is available at https://github.com/tum BAIS/Structured-RL.
Open Datasets Yes We first consider three static environments common industrial benchmarks from Dalle et al. [2022] namely, a Warcraft Shortest Paths Problem, a Single Machine Scheduling Problem, and a Stochastic Vehicle Scheduling Problem. [...] The Dynamic environments model online decision-making in C-MDPs. We consider: i) a Dynamic Vehicle Scheduling Problem (DVSP), based on the Dynamic Vehicle Routing Problem introduced by Kool et al. [2022], Baty et al. [2024]. ii) A Dynamic Assortment Problem (DAP), adapted from Dulac Arnold et al. [2016] and Chen et al. [2020]. iii) A Gridworld Shortest Paths Problem (GSPP), inspired by gridworld and robotic control tasks [Chandak et al., 2019, Zhang et al., 2020].
Dataset Splits Yes We separate all instances into a train, validation, and test dataset. [...] We train and test using | V | = 25 tasks. [...] An episode has 100 time steps, we use 100 train, validation, and test-episodes.
Hardware Specification Yes We conduct all experiments on a Mac Book Air M3, using the Julia programming language.
Software Dependencies No We conduct all experiments on a Mac Book Air M3, using the Julia programming language. [...] All other Julia packages we used are publicly available.
Experiment Setup Yes We present an overview over the hyperparameters of the algorithms in Table 2. For the RL algorithms, an episode consists of testing the algorithm s performance, collecting experience in the environment, and performing a number of updates, specified as iterations. [...] We tune hyperparameters per algorithm and environment, using the PPO-optimized episode numbers consistently across methods. Each algorithm is retrained using ten random seeds. Appendix C provides further details on the experimental setup and baselines.