Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Ranking-based Preference Optimization for Diffusion Models from Implicit User Feedback

Authors: Yi-Lun Wu, Bo-Kai Ruan, Chiang Tseng, Hong-Han Shuai

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments show that Diffusion-DRO delivers improved generation quality across a range of challenging and unseen prompts, outperforming state-of-the-art baselines in both both quantitative metrics and user studies. Our source code and pre-trained models are available at https://github.com/basiclab/Diffusion DRO.
Researcher Affiliation Academia Yi-Lun Wu Bo-Kai Ruan Chiang Tseng Hong-Han Shuai Institute of Electrical and Computer Engineering, National Yang Ming Chiao Tung University EMAIL
Pseudocode Yes Algorithm 1 Diffusion Denoising Ranking Optimization
Open Source Code Yes Our source code and pre-trained models are available at https://github.com/basiclab/Diffusion DRO.
Open Datasets Yes Following prior works [17, 35, 18], we use the train split of Pick-a-Pic v2 [14] (MIT license) as our training dataset. For evaluation, we adopt the test split of Pick-a-Pic v2 and the HPDv2 benchmark [36] (Apache-2.0 license), representing in-domain and out-of-domain scenarios, respectively.
Dataset Splits Yes We use the train split of Pick-a-Pic v2 [14] (MIT license) as our training dataset. For evaluation, we adopt the test split of Pick-a-Pic v2 and the HPDv2 benchmark [36] (Apache-2.0 license), representing in-domain and out-of-domain scenarios, respectively. ... The first is the Pick-a-Pic v2 test set, which includes 500 unique text prompts collected from users of the deployed web application. The second is the HPDv2 Benchmark, divided into four categories: anime, concept art, painting, and photo. Each category contains 800 text prompts.
Hardware Specification Yes All experiments, including the reproduction of baseline methods with updated SD model weights, were conducted on four NVIDIA RTX 3090 GPUs.
Software Dependencies No The paper mentions the use of 'Adam W optimizer [20]' and 'DPMSolver++ [21]', and fine-tuning 'Stable Diffusion 1.5 (SD v1-5) [31]', but does not specify version numbers for the software libraries or frameworks (e.g., PyTorch, TensorFlow) used for implementation.
Experiment Setup Yes The Adam W optimizer [20] is used with a learning rate of 10-4 and an effective batch size of 256 (4 samples per GPU, 32 gradient accumulation steps, yielding 4 * 4 * 16 = 256). The training consists of 1,600 optimization steps, resulting in a total of 16 * 1,600 = 25,600 iterations when accounting for gradient accumulation. During training, 20% of prompts are randomly replaced with empty strings, which helps preserve the model’s ability to perform unconditional generation by maintaining a balance between conditional and unconditional sampling. ... The clipping threshold m for the thresholded ranking loss (TRL) is set to 0.001, and the policy model update interval M is set to 1 for all experiments. For sampling xt from the policy model, we employ DPMSolver++ [21] with 20 steps, without utilizing classifier-free guidance [12].