Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Flow-Based Policy for Online Reinforcement Learning
Authors: Lei Lyu, Yunfei Li, Yu Luo, Fuchun Sun, Tao Kong, Jiafeng Xu, Xiao Ma
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations on DMControl and Humanoidbench demonstrate that Flow RL achieves competitive performance in online reinforcement learning benchmarks.We have released our code here. We evaluate our approach on challenging DMControl [40] and Humanoid Bench [36], demonstrating competitive performance against state-of-the-art baselines. |
| Researcher Affiliation | Collaboration | Lei Lv 1,2,3,, Yunfei Li2,3, Yu Luo3, Fuchun Sun 3, Tao Kong2, Jiafeng Xu 2, Xiao Ma2 1 Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University 2 Byte Dance Seed 3 Tsinghua University EMAIL;EMAIL; EMAIL |
| Pseudocode | Yes | Algorithm 1 Flow RL |
| Open Source Code | Yes | We have released our code here. |
| Open Datasets | Yes | Empirical evaluations on DMControl and Humanoidbench demonstrate that Flow RL achieves competitive performance in online reinforcement learning benchmarks.We have released our code here. We evaluate our approach on challenging DMControl [40] and Humanoid Bench [36], demonstrating competitive performance against state-of-the-art baselines. |
| Dataset Splits | No | All model-free algorithms (Flow RL, SAC, QVPO, TD3) are evaluated with 5 random seeds, while the model-based algorithm (TD-MPC2) uses 3 seeds. The paper discusses the 'online off-policy RL setting, where the agent interacts with the environment and collects new data into a replay buffer D D {(s, a, s , r)}.' This indicates dynamic data generation rather than static dataset splits. |
| Hardware Specification | Yes | All experiments are conducted on a single NVIDIA H100 GPU and an Intel(R) Platinum 8480C CPU, with two tasks running in parallel on the GPU. |
| Software Dependencies | No | For SAC [13], we utilized the open-source Py Torch implementation, available at https://github. com/pranz24/pytorch-soft-actor-critic. (Does not specify PyTorch version). Optimizer Adam (no version). |
| Experiment Setup | Yes | The hyperparameters used in our experiments are summarized in Table 1. For the choice of the weighting function, we use f(x) = I(x) exp(x), where I(x) is the indicator function, i.e., I(x) = 1, if x > 0 0, otherwise. For numerical stability, the Q function is normalized by subtracting its mean exclusively during the computation of the weighting function. Table 1: Hyperparameters Hyperparameter Value Optimizer Adam Critic learning rate 3 10 4 Actor learning rate 3 10 4 Discount factor 0.99 Batchsize 256 Replay buffer size 1 106 Expectile factor τ 0.9 Lagrangian multiplier λ 0.1 Flow steps N 1 ODE Slover Midpoint Euler Value network Network hidden dim 512 Network hidden layers 3 Network activation function mish Policy network Network hidden dim 512 Network hidden layers 2 Network activation function elu |