Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Normalizing Flows are Capable Models for Continuous Control

Authors: Raj Ghugare, Benjamin Eysenbach

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments (Section 6) provide compelling evidence that this is not the case. We hope that our work brings greater attention towards the potential of NFs in RL, and the development of scalable probabilistic models with similar desirable properties (Fig. 1).
Researcher Affiliation Academia Raj Ghugare Benjamin Eysenbach Department of Computer Science Princeton University EMAIL
Pseudocode No The paper includes mathematical equations (Eq. 1-6) and a visualization of an 'NF block' in Figure 7, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code Yes The code for all experiments can be found here : https://github.com/Princeton-RL/normalising-flows-4reinforcement-learning.
Open Datasets Yes Tasks. We use a total of 45 tasks from OGBench [67] meant to test diverse capabilities like learning from suboptimal trajectories with high dimensional actions, trajectory stitching and long-horizon reasoning. Tasks. We use a total of 30 tasks from prior work [68] testing offline RL with expressive policies. Each task has a goal which requires completing K subtasks. At each step, the agent gets a reward equal to the negative of the number of subtasks left to be completed. For our experiments, we choose tasks that cover a diverse set of challenges and robots. Tasks like humanoidmaze-medium-navigate and antmaze-medium-navigate contain diverse suboptimal trajectories and high dimensional actions. Antsoccer-arena-navigate requires controlling a quadruped agent to dribble a ball to a goal location. Scene-play requires manipulating multiple objects and puzzle-3x3-play requires combinatorial generalization and long-horizon reasoning. We hypothesize that the Q function for the some of these tasks (esp manipulation tasks like puzzle and scene which require long horizon reasoning) can have narrow modes of good actions and a large number of bad actions. Hence using direct gradient based optimization can be crucial to search for good actions. Baselines. We choose 5 representative baselines. BC performs BC with a gaussian policy, Re BRAC [84] is an offline RL algorithm with a gaussian policy that achieves impressive results on prior benchmarks [27].
Dataset Splits No The paper mentions running experiments with '5 seeds each' but does not specify train/test/validation splits, proportions, or sample counts for the datasets used. It refers to tasks from OGBench and D4RL without detailing how data was partitioned within those tasks for the experiments.
Hardware Specification Yes All experiments were done on RTX 3090s, A5000s or A6000s and did not require more than 20 hours.
Software Dependencies No We have also provided the code for our architecture both in Jax [10] and Py Torch [69]. The paper mentions these frameworks but does not specify their version numbers or any other software dependencies with version details.
Experiment Setup Yes Appendix A contains the implementation details of all the algorithms, including the hyperparameters used and their values. For example, for NF-BC, 'we use a policy with 12 NF blocks. The fully connected networks of the coupling network for each NF block consist of two hidden layers with 512 activations each.' Table 2 further details hyperparameters like 'channels 512', 'noise std 0.1', 'blocks 12', etc.