Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Aligning Text-to-Image Diffusion Models to Human Preference by Classification

Authors: Longquan Dai, Xiaolu Wei, wang he, Shaomeng Wang, Jinhui Tang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on various diffusion models show that our ABC consistently outperforms existing baselines, offering a scalable and robust solution for preference-based text-to-image fine-tuning.
Researcher Affiliation	Academia	Longquan Dai, Xiaolu Wei, He Wang, Shaomeng Wang, and Jinhui Tang Nanjing University of Science and Technology, Nanjing, China EMAIL
Pseudocode	No	The paper includes mathematical formulations, theorems, and equations but does not present any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Code is available at https://github.com/dailongquan/abc.
Open Datasets	Yes	Diffusion-DPO [51] is fine-tuned on Pick-a-Pic [23], a human preference dataset for text-to-image generation. We evaluate the alignment performance of diffusion models using prompts from HPS and Parti Prompts across various evaluators. To validate the effectiveness of Theorem 1, we evaluate zero-shot classification performance on six benchmark datasets: Food-101 [3], CIFAR-10 [24], Aircraft [32], Pets [36], Flowers102 [33], and STL-10 [10].
Dataset Splits	No	The paper mentions fine-tuning on the Pick-a-Pic dataset and evaluating on various benchmarks like HPS and Parti Prompts, but it does not specify the train/validation/test splits used for the ABC model training itself. For user studies, it mentions "randomly sampled 100 prompts from the Parti Prompts dataset and another 100 prompts from the HPSv2 benchmark" but this is for evaluation, not model training splits.
Hardware Specification	Yes	We train the models using the Adam W [30] optimizer for SD1.5, Adafactor [45] optimizer for SDXL on 8 A6000 GPUs
Software Dependencies	No	The paper mentions using "Adam W [30] optimizer" and "Adafactor [45] optimizer" but does not specify software versions for programming languages, libraries, or other key components.
Experiment Setup	Yes	We train the models using the Adam W [30] optimizer for SD1.5, Adafactor [45] optimizer for SDXL on 8 A6000 GPUs, with a batch size of 2, gradient accumulation of 128 steps and a learning rate of 1 10 8, incorporating a linear warmup schedule. For SD1.5 and SDXL training, δ is set to 0.025.