Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Flow-Based Policy for Online Reinforcement Learning

Authors: Lei Lyu, Yunfei Li, Yu Luo, Fuchun Sun, Tao Kong, Jiafeng Xu, Xiao Ma

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations on DMControl and Humanoidbench demonstrate that Flow RL achieves competitive performance in online reinforcement learning benchmarks.We have released our code here. We evaluate our approach on challenging DMControl [40] and Humanoid Bench [36], demonstrating competitive performance against state-of-the-art baselines.
Researcher Affiliation	Collaboration	Lei Lv 1,2,3,, Yunfei Li2,3, Yu Luo3, Fuchun Sun 3, Tao Kong2, Jiafeng Xu 2, Xiao Ma2 1 Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University 2 Byte Dance Seed 3 Tsinghua University EMAIL;EMAIL; EMAIL
Pseudocode	Yes	Algorithm 1 Flow RL
Open Source Code	Yes	We have released our code here.
Open Datasets	Yes	Empirical evaluations on DMControl and Humanoidbench demonstrate that Flow RL achieves competitive performance in online reinforcement learning benchmarks.We have released our code here. We evaluate our approach on challenging DMControl [40] and Humanoid Bench [36], demonstrating competitive performance against state-of-the-art baselines.
Dataset Splits	No	All model-free algorithms (Flow RL, SAC, QVPO, TD3) are evaluated with 5 random seeds, while the model-based algorithm (TD-MPC2) uses 3 seeds. The paper discusses the 'online off-policy RL setting, where the agent interacts with the environment and collects new data into a replay buffer D D {(s, a, s , r)}.' This indicates dynamic data generation rather than static dataset splits.
Hardware Specification	Yes	All experiments are conducted on a single NVIDIA H100 GPU and an Intel(R) Platinum 8480C CPU, with two tasks running in parallel on the GPU.
Software Dependencies	No	For SAC [13], we utilized the open-source Py Torch implementation, available at https://github. com/pranz24/pytorch-soft-actor-critic. (Does not specify PyTorch version). Optimizer Adam (no version).
Experiment Setup	Yes	The hyperparameters used in our experiments are summarized in Table 1. For the choice of the weighting function, we use f(x) = I(x) exp(x), where I(x) is the indicator function, i.e., I(x) = 1, if x > 0 0, otherwise. For numerical stability, the Q function is normalized by subtracting its mean exclusively during the computation of the weighting function. Table 1: Hyperparameters Hyperparameter Value Optimizer Adam Critic learning rate 3 10 4 Actor learning rate 3 10 4 Discount factor 0.99 Batchsize 256 Replay buffer size 1 106 Expectile factor τ 0.9 Lagrangian multiplier λ 0.1 Flow steps N 1 ODE Slover Midpoint Euler Value network Network hidden dim 512 Network hidden layers 3 Network activation function mish Policy network Network hidden dim 512 Network hidden layers 2 Network activation function elu