Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

BraVE: Offline Reinforcement Learning for Discrete Combinatorial Action Spaces

Authors: Matthew Landers, Taylor W Killian, Hugo Barnes, Tom Hartvigsen, Afsaneh Doryab

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that Bra VE consistently outperforms state-of-the-art baselines across a suite of challenging offline RL tasks with combinatorial action spaces containing up to 4 million discrete actions. In high-dimensional environments with strong sub-action dependencies, Bra VE improves average return by up to 20 over state-of-the-art offline RL methods. While baseline performance degrades with increasing sub-action dependencies or action space size, Bra VE maintains stable performance. Section 4: Experimental Evaluation
Researcher Affiliation Academia 1University of Virginia, 2MBZUAI
Pseudocode Yes Algorithm 1 Compute Bra VE Loss
Open Source Code Yes EMAIL 2Code is available at https://github.com/matthewlanders/BraVE To support reproducibility, we have released our code at: https://anonymous.4open.science/r/BraVE-28CD.
Open Datasets No We evaluate Bra VE in the Combinatorial Navigation Environment (Co NE), a highdimensional discrete control domain designed to stress-test policy learning under large action spaces and sub-action dependencies. ... Dataset Construction We generate offline datasets using a stochastic variant of A . Justification: An anonymized link to our implementation is provided in a footnote on page 1. This supplemental material includes instructions for running the code and reproducing results. It also contains an implementation of Co NE, including the dataset generation process.
Dataset Splits No Dataset Construction We generate offline datasets using a stochastic variant of A . At each step, the optimal action is selected with probability 0.1, and a random valid action is chosen otherwise. This procedure yields a diverse mixture of trajectories with varying returns, including both near-optimal and suboptimal behavior. The resulting datasets reflect realistic offline settings in which learning must proceed from heterogeneous, partially optimal demonstrations. All methods are trained for 20,000 gradient steps and evaluated every 100 steps.
Hardware Specification Yes All experiments were conducted on a single NVIDIA A40 GPU using Python 3.9 and Py Torch 2.6. These results are based on tests conducted on a 10-core CPU with 32 GB of unified memory and no GPU.
Software Dependencies Yes All experiments were conducted on a single NVIDIA A40 GPU using Python 3.9 and Py Torch 2.6.
Experiment Setup Yes All methods are trained for 20,000 gradient steps and evaluated every 100 steps. Bra VE is trained using a behavior-regularized temporal difference (TD) loss that penalizes value estimates for actions unlikely under the dataset. Given a transition (s, a, r, s , a ) sampled from the replay buffer B, the TD target is computed using the action ˆa = arg maxa Q(s , a ; θ ) selected via the tree traversal procedure described in Section 3.3. The loss is defined as: LT D(θ) = E(s,a,r,s ,a ) B h λ r + γQ(s , ˆa ; θ ) ˆa a Q(s, a; θ) 2i , (4) where λ is a regularization coefficient and ˆa a penalizes deviation from the behavior action, following principles introduced in TD3+BC (Equation 2). We combine this behavior-regularized TD loss with a branch value supervision loss LBra VE, resulting in a total objective L = αLTD + LBra VE, where α controls the relative weighting of the TD term. During training, we apply a depth-based weighting factor δ in the Bra VE loss computation. At inference time, we use beam search to improve action selection robustness. Instead of committing to a single greedy path, the algorithm retains the top-W actions (ranked by predicted values) at each tree level.