Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Value Gradient Guidance for Flow Matching Alignment

Authors: Zhen Liu, Tim Xiao, Carles Domingo i Enrich, Weiyang Liu, Dinghuai Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we show on a popular text-to-image flow matching model, Stable Diffusion 3, that our method can finetune flow matching models under limited computational budgets while achieving effective and prior-preserving alignment.
Researcher Affiliation	Collaboration	Zhen Liu1 Tim Z. Xiao2* Carles Domingo-Enrich3* Weiyang Liu4 Dinghuai Zhang3,5 1The Chinese University of Hong Kong (Shenzhen) 2University of Tübingen 3Microsoft Research 4The Chinese University of Hong Kong 5Mila Quebec AI Institute
Pseudocode	Yes	Algorithm 1 VGG-Flow algorithm Require: Pretrained flow matching model vbase(x, t), given reward function r(x1), value gradient model gϕ(x, t) parameterized by Equation 16. Ensure: Finetuned flow matching model vθ(x, t) Initialize flow matching model vθ vbase. while Stopping criterion not met do Collect trajectories {xt}t via solving the current neural ODE xt = vθ(xt, t). Update value gradient model gϕ(x, t) with loss Lconsistency(ϕ) + αLboundary(ϕ). Update velocity field model vθ(x, t) with loss Lmatching(θ). end while
Open Source Code	Yes	Corresponding author vgg-flow.github.io Justification: We release the whole set of code that can reproduce our algorithm.
Open Datasets	Yes	Prompt dataset. For Aesthetic Score, we use a set of simple animal prompts used in the original DDPO paper [6]; for HPSv2, we consider photo+painting prompts from the human preference dataset (HPDv2) [67]; for Pick Score, we use the prompt set in the Pic-a-Pick dataset [32].
Dataset Splits	No	The paper describes using specific prompt datasets (Aesthetic Score, HPSv2, Pick Score) to generate images for evaluation and finetuning reward models, but it does not specify traditional training/validation/test splits for these datasets within the context of their own model development or evaluation. The prompt sets are used as inputs for generation, not as partitioned datasets in the typical sense.
Hardware Specification	No	The paper mentions using "4 GPUs for each run" for adjoint matching experiments and "bfloat16 computation for the flow matching model", which implies GPU usage. However, it does not specify the exact GPU models (e.g., NVIDIA A100, RTX 3090) or any other specific hardware details like CPU, memory, or cloud instance types.
Software Dependencies	No	The paper mentions using "Py Torch" for finite difference methods and Jacobian-vector products, and "Adam W optimizer", but it does not provide specific version numbers for these software libraries or any other key dependencies.
Experiment Setup	Yes	Base model. Throughout the paper, we consider the popular open-sourced text-conditioned flow matching model Stable Diffusion 3 [17] and a 20-step Euler solver to sample trajectories. Experiment settings and implementation details. We use Lo RA parametrization [31] on attention layers of the finetuned flow matching model with a Lo RA rank of 8. The value gradient network in VGG-Flow is set to be a scaled-down version of the Stable Diffusion-v1.5 U-Net, initialized with tiny weights in the final output layers. ... For all experiments, we use 3 random seeds. For Aesthetic Score, HPSv2 and Pick Score experiments, we set the default inverse temperature terms β = 1/λ to 5e4, 3e7 and 5e5, respectively; for all ablation studies with Aesthetic Score, we set β = 1e4. We set the boundary loss coefficient α to 10000 for all experiments. We use an effective batch size of 32 for all methods, and only use on-policy samples without any replay buffer. For VGG-Flow, we sub-sample the collected trajectories by uniformly splitting each into 5 bins and then taking one transition out of each; we also clip the computed reward gradients in Eqn. 16 at the 80th percentile of the gradient norms of the corresponding training batches. For Re FL, we sample the truncation time step between 15 and 20. We follow prior work [51, 70] use Re LU(r(x)) as the reward model for both Re FL and DRa FT for stable training. ... We use the best learning rates ... Specifically, we use 5e 4 for VGG-Flow on all reward models, 5e 5 for VGG-Flow-PMP on HPSv2 and Pick Score, and 1e 4 for all others. We use the standard Adam W optimizer with β1 = 0.9, β2 = 0.999 and weight decay 1e 2. We clip the norm of network update gradients to 1. We use bfloat16 computation for the flow matching model but float32 for the reward model due to numerical precision issues.