Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Ranking-based Preference Optimization for Diffusion Models from Implicit User Feedback
Authors: Yi-Lun Wu, Bo-Kai Ruan, Chiang Tseng, Hong-Han Shuai
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments show that Diffusion-DRO delivers improved generation quality across a range of challenging and unseen prompts, outperforming state-of-the-art baselines in both both quantitative metrics and user studies. Our source code and pre-trained models are available at https://github.com/basiclab/Diffusion DRO. |
| Researcher Affiliation | Academia | Yi-Lun Wu Bo-Kai Ruan Chiang Tseng Hong-Han Shuai Institute of Electrical and Computer Engineering, National Yang Ming Chiao Tung University EMAIL |
| Pseudocode | Yes | Algorithm 1 Diffusion Denoising Ranking Optimization |
| Open Source Code | Yes | Our source code and pre-trained models are available at https://github.com/basiclab/Diffusion DRO. |
| Open Datasets | Yes | Following prior works [17, 35, 18], we use the train split of Pick-a-Pic v2 [14] (MIT license) as our training dataset. For evaluation, we adopt the test split of Pick-a-Pic v2 and the HPDv2 benchmark [36] (Apache-2.0 license), representing in-domain and out-of-domain scenarios, respectively. |
| Dataset Splits | Yes | We use the train split of Pick-a-Pic v2 [14] (MIT license) as our training dataset. For evaluation, we adopt the test split of Pick-a-Pic v2 and the HPDv2 benchmark [36] (Apache-2.0 license), representing in-domain and out-of-domain scenarios, respectively. ... The first is the Pick-a-Pic v2 test set, which includes 500 unique text prompts collected from users of the deployed web application. The second is the HPDv2 Benchmark, divided into four categories: anime, concept art, painting, and photo. Each category contains 800 text prompts. |
| Hardware Specification | Yes | All experiments, including the reproduction of baseline methods with updated SD model weights, were conducted on four NVIDIA RTX 3090 GPUs. |
| Software Dependencies | No | The paper mentions the use of 'Adam W optimizer [20]' and 'DPMSolver++ [21]', and fine-tuning 'Stable Diffusion 1.5 (SD v1-5) [31]', but does not specify version numbers for the software libraries or frameworks (e.g., PyTorch, TensorFlow) used for implementation. |
| Experiment Setup | Yes | The Adam W optimizer [20] is used with a learning rate of 10-4 and an effective batch size of 256 (4 samples per GPU, 32 gradient accumulation steps, yielding 4 * 4 * 16 = 256). The training consists of 1,600 optimization steps, resulting in a total of 16 * 1,600 = 25,600 iterations when accounting for gradient accumulation. During training, 20% of prompts are randomly replaced with empty strings, which helps preserve the modelβs ability to perform unconditional generation by maintaining a balance between conditional and unconditional sampling. ... The clipping threshold m for the thresholded ranking loss (TRL) is set to 0.001, and the policy model update interval M is set to 1 for all experiments. For sampling xt from the policy model, we employ DPMSolver++ [21] with 20 steps, without utilizing classifier-free guidance [12]. |