Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SPRO: Improving Image Generation via Self-Play

Authors: Ritika Jha, Aanisha Bhattacharyya, Yaman Singla, Rajiv Ratn Shah, Changyou Chen, Balaji Krishnamurthy

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted experiments to evaluate the effectiveness of our self-play framework, SPRO, to optimize images for various human preference objectives. Data-generation and training details are provided in Appendix A.4. Reward Metrics We focus on three human preference objectives: aesthetic appeal, user preference, and engagement, each representing a distinct dimension of evaluation. These objectives are quantified using learned reward models. For user preference, we use Pick Score [11], a CLIP-based model trained on large-scale preference data. For engagement, we adopt Engage Net [9], a foundation model trained on Twitter data to predict the social media engagement an image is likely to receive. For aesthetic appeal, we employ the LAION aesthetic scorer [22], trained on human-annotated aesthetic ratings.
Researcher Affiliation	Collaboration	Adobe Media and Data Science Research, SUNY at Buffalo # EMAIL
Pseudocode	No	The paper describes methods through text and figures (e.g., Figure 2: "Our SPRO framework consists of three stages") but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps in a code-like format.
Open Source Code	Yes	The code with all details will be added in the supplementary material.
Open Datasets	Yes	We release a synthetic dataset of over one million image prompt pairs aligned with human preferences, generated entirely through self-play using SPRO. This dataset provides a scalable, annotation-free resource for future research in preference alignment. The dataset is available here. In this work, we use three available datasets: Flickr30k [30] and Pick-a-Pic [11] and Engaging Image Net [9].
Dataset Splits	Yes	For aesthetic appeal, we use a held-out set of 514 images from the Flickr30k dataset. We evaluate on the Parti Prompts [31] and Pick-a-pic test split. (1) Aesthetic appeal: 30,000 image-caption pairs from the Flickr30k dataset [30]. (2) Engagement: 35,000 images sampled from the Engaging Image Net train set [9], with equal representation from the top and bottom 20th percentiles of like counts to capture both highly and poorly engaging content.(3) User preference: 17,400 images from the validation split of the Pick-a-pic dataset [11].
Hardware Specification	Yes	All experiments are conducted on 8 NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions models like "LLa MA-3.2-11B-Vision-Instruct [14]" and "frozen Stable Diffusion XL (SDXL) [17]" but does not specify software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA 11.x).
Experiment Setup	Yes	DPO fine-tuning is performed on same with a per-device batch size of 2, using gradient accumulation to achieve a larger effective batch size. Training is run for 2 epochs using bf16 precision and updates all model parameters. For image generation, we use a frozen Stable Diffusion XL (SDXL) [17] as the base diffusion model. In the SPRO-Image method, we finetune the same SDXL model using high-reward synthetic images paired with their corresponding base captions. We use a resolution of 512 512, a learning rate of 1e-6, the 8-bit Adam optimizer, and train for 50 epochs with a gradient accumulation factor of 4.