Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning

Authors: Tonghe Zhang, Chao Yu, Sichang Su, Yu Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We benchmark Rein Flow in representative locomotion and manipulation tasks, including long-horizon planning with visual input and sparse reward. The episode reward of Rectified Flow policies obtained an average net growth of 135.36% after fine-tuning... The success rate of the Shortcut Model policies... achieved an average net increase of 40.34% after fine-tuning... We perform extensive experiments in representative robot locomotion and manipulation tasks, with the agent receiving state or pixel observations and possibly accepting sparse rewards.
Researcher Affiliation Academia Tonghe Zhang Robotics Institute Carnegie Mellon University EMAIL Chao Yu Shenzhen International Graduate School Tsinghua University EMAIL Sichang Su Department of Aerospace Engineering The University of Texas at Austin EMAIL Yu Wang Department of Electronic Engineering Tsinghua University EMAIL
Pseudocode Yes Algorithm 1 Rein Flow 1: Input pre-trained flow matching policy s velocity field vθ; denoising step number K, discount factor γ, batch size B, discretization scheme 0 = t0 < t1 < . . . < t K = 1} with tk := tk+1 tk regularization function R with intensity coefficient α R. 2: Initialize noise injection network θ . 3: while not converged do 4: Restore last iteration s parameters: θold stop_grad([θ, θ ]) 5: Reset environment and receive initial observation o. 6: while not done do Rollout policy π. 7: Sample a0 N(0, Id A) 8: for denoising step k in {0, 1, . . . , K 1} do Inject noise and integrate. 9: ak+1 ak + vθ(tk, ak, o) tk + σθ (tk, ak, o)ϵ, ϵ N(0, Id A) 10: end for 11: Record denoised actions a0, a1, . . . , a K in a buffer. 12: Play action a = a K, receive reward r and done flag d, update observation o. 13: Store {a, o, r, d} to buffer, where a := (a0, a1, . . . , a K) 14: end while 15: Sample a batch of data {ai, oi, ri, di}B i=1 from buffer. Optimize policy. 16: Compute the policy s transition probability for each denoising step k by Eq. (7): 17: ln π θ(ak+1 i |ak i , oi) = ln N ak+1 i |ak i + vθ tk, ak i , oi tk , σ2 θ tk, ak i , oi 18: Compute the regularization function R evaluated on each tuple, denoted as R(ai, oi; θ, θold). 19: Call a policy gradient sub-routine, such as Alg. 2, to jointly optimize θ and θ by Eq. (9): θ, θ = argmin θ,θ 1 B A θold(oi, ai) k=0 ln π θ(ak+1 i |ak i , oi) + α R(ai, oi; θ, θold) where θ := [θ, θ ] 20: end while 21: Return fine-tuned flow matching policy s velocity field vθ. Algorithm 2 Rein Flow Subroutine for Policy Optimization (PPO implementation) 1: Input clipping range ϵ (0, 1), policy parameters at the current iteration θ := [θ, θ ] and the last iteration θold, data {ai, oi, ri, di}B i=1, regularization function values {R(ai, oi; θ, θold)}B i=1, with intensity α R. 2: Compute the advantage estimates b Ai := b A(oi, a K i ) by methods such as GAE [46] 3: Jointly optimize the velocity net θ and noise net θ by taking gradient step on the clipped surrogate loss: min π θ(ai|oi) π θold(ai|oi) b Ai , clip π θ(ai|oi) π θold(ai|oi), 1 ϵ, 1 + ϵ b Ai + α R(ai, oi; θ, θold) 4: Return updated parameters θ, θ
Open Source Code Yes Code, model, and checkpoints available on the project website: https://reinflow.github.io/
Open Datasets Yes Expert data are mediumor medium-expert-level demonstrations collected from the D4RL dataset [19]. Franka Kitchen [21]. Robomimic [36]. Open AI Gym [9].
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits. It mentions using known datasets like D4RL, Franka Kitchen, and Robomimic, and describes the nature of the data (e.g., expert demonstrations, human teleoperated data) but does not detail how these datasets are partitioned into splits for training, validation, or testing in the conventional supervised learning sense. For RL fine-tuning, the 'training data' is often generated through interaction with the environment rather than fixed splits.
Hardware Specification Yes on a single NVIDIA RTX 3090 GPU with EGL rendering. We evaluate the wall time of Robomimic Transport on two NVIDIA A100 GPUs with EGL rendering
Software Dependencies No The paper mentions optimizers like Adam [26] and activation functions like Mish [37], but does not provide specific version numbers for software libraries (e.g., PyTorch, TensorFlow, scikit-learn) or other tools used in its implementation. It mentions 'Mu Jo Co graphics rendering backend (MUJOCO_GL) is set to Embedded System Graphics Library (EGL)' but without version.
Experiment Setup Yes We list the key hyper-parameters and model architectures needed to reproduce the experiment results of Rein Flow and other baseline algorithms. ... Table 6: Rein Flow s Shared Hyperparameters Across All Tasks. Table 7: Rein Flow s Hyperparameters in Open AI Gym Locomotion Tasks. Table 8: Rein Flow s Hyperparameters in Franka Kitchen State-input Manipulation Tasks. Table 9: Rein Flow s Hyperparameters in Robomimic Visual Manipulation Tasks. Table 10: Hyperparameters for FQL in Gym and Franka Kitchen Tasks.