Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Ranking-based Preference Optimization for Diffusion Models from Implicit User Feedback

Authors: Yi-Lun Wu, Bo-Kai Ruan, Chiang Tseng, Hong-Han Shuai

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments show that Diffusion-DRO delivers improved generation quality across a range of challenging and unseen prompts, outperforming state-of-the-art baselines in both both quantitative metrics and user studies. Our source code and pre-trained models are available at https://github.com/basiclab/Diffusion DRO.
Researcher Affiliation	Academia	Yi-Lun Wu Bo-Kai Ruan Chiang Tseng Hong-Han Shuai Institute of Electrical and Computer Engineering, National Yang Ming Chiao Tung University EMAIL
Pseudocode	Yes	Algorithm 1 Diffusion Denoising Ranking Optimization
Open Source Code	Yes	Our source code and pre-trained models are available at https://github.com/basiclab/Diffusion DRO.
Open Datasets	Yes	Following prior works [17, 35, 18], we use the train split of Pick-a-Pic v2 [14] (MIT license) as our training dataset. For evaluation, we adopt the test split of Pick-a-Pic v2 and the HPDv2 benchmark [36] (Apache-2.0 license), representing in-domain and out-of-domain scenarios, respectively.
Dataset Splits	Yes	We use the train split of Pick-a-Pic v2 [14] (MIT license) as our training dataset. For evaluation, we adopt the test split of Pick-a-Pic v2 and the HPDv2 benchmark [36] (Apache-2.0 license), representing in-domain and out-of-domain scenarios, respectively. ... The first is the Pick-a-Pic v2 test set, which includes 500 unique text prompts collected from users of the deployed web application. The second is the HPDv2 Benchmark, divided into four categories: anime, concept art, painting, and photo. Each category contains 800 text prompts.
Hardware Specification	Yes	All experiments, including the reproduction of baseline methods with updated SD model weights, were conducted on four NVIDIA RTX 3090 GPUs.
Software Dependencies	No	The paper mentions the use of 'Adam W optimizer [20]' and 'DPMSolver++ [21]', and fine-tuning 'Stable Diffusion 1.5 (SD v1-5) [31]', but does not specify version numbers for the software libraries or frameworks (e.g., PyTorch, TensorFlow) used for implementation.
Experiment Setup	Yes	The Adam W optimizer [20] is used with a learning rate of 10-4 and an effective batch size of 256 (4 samples per GPU, 32 gradient accumulation steps, yielding 4 * 4 * 16 = 256). The training consists of 1,600 optimization steps, resulting in a total of 16 * 1,600 = 25,600 iterations when accounting for gradient accumulation. During training, 20% of prompts are randomly replaced with empty strings, which helps preserve the model’s ability to perform unconditional generation by maintaining a balance between conditional and unconditional sampling. ... The clipping threshold m for the thresholded ranking loss (TRL) is set to 0.001, and the policy model update interval M is set to 1 for all experiments. For sampling xt from the policy model, we employ DPMSolver++ [21] with 20 steps, without utilizing classifier-free guidance [12].