Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models

Authors: Jiajun Fan, Tong Wei, Chaoran Cheng, Yuxin Chen, Ge Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implemented ADRPO on SD3 (2B parameters) using prompts from Draw Bench [28] testing color attribute binding, compositional reasoning, object counting, spatial relationships, and text rendering, as well as artistic style transfer prompts from RAFT [9]. Our method employed the advantage-based ADRPO loss from Equation (6) with β0 = 1 and Amax = 1, Amin = 1, using CLIP score as rewards [25]. We compared against offline methods like DPO [37], online approaches like ORW-CFM-W2 [14], and substantially larger models including FLUX.1 Dev (12B) [43] and SANA-1.5 (4.8B) [38].
Researcher Affiliation Academia Jiajun Fan, Tong Wei, Chaoran Cheng, Yuxin Chen, Ge Liu University of Illinois Urbana-Champaign EMAIL
Pseudocode Yes Our algorithm pseudocode in App. C (Algorithm 1) further enhances reproducibility by detailing the implementation of ADRPO for SD3 fine-tuning. We first detail our algorithm pseudocode in Algorithm 1 for fine-tuning flow matching models (we use linear interpolation path as an example).
Open Source Code No Our implementation details and algorithm pseudocode in App. C and App. B provide sufficient information for reproduction, and we plan to release our codes soon after publication.
Open Datasets Yes We use publicly available models (SD3 [13], Qwen2 [41], Qwen3 [42], Qwen2.5-Omni [40]) and datasets (Draw Bench [28], RLHFlow [9], AVQA [44], MMAU [29]) as noted in App. B.
Dataset Splits Yes For the large language model fine-tuning, we have used the RLHFlow/test_generation_2k dataset [10], containing 2,000 diverse prompts compiled from high-quality instruction-following datasets, and we randomly choose 10% as test prompts.
Hardware Specification Yes All experiments were conducted on NVIDIA A6000 (48GB) GPUs.
Software Dependencies No The paper mentions software components and techniques like Lo RA [18] for parameter-efficient adaptation but does not explicitly state specific version numbers for programming languages, libraries, or frameworks (e.g., Python 3.8, PyTorch 1.9).
Experiment Setup Yes Finetuning LLMs. We fine-tuned Qwen2 [41] and Qwen3 [42] using RM-Gemma-2B [27, 9] as the reward model on RLHFlow/test_generation_2k dataset [9]. ADRPO was integrated with GRPO using KL-divergence regularization (Equation (7)) with β0 = 0.04, Amin = 0.04, Amax = 0.04, compared against standard GRPO with fixed β = 0.04 [32]. Fine-tuning Multi-Modal Reasoning Models. We fine-tuned Qwen2.5-Omni-7B [40] on the AVQA dataset [44] using verifiable and format rewards, evaluated on the MMAU benchmark [29]. We used β0 = 0.04, Amin = 0.04, Amax = 0.04, comparing against GRPO baseline and commercial models including Gemini 2.5 Pro [5] and GPT-4o Audio [19].