Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Fast Data Attribution for Text-to-Image Models

Authors: Sheng-Yu Wang, Aaron Hertzmann, Alexei A Efros, Richard Zhang, Jun-Yan Zhu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show extensive results on both medium-scale models trained on MSCOCO and large-scale Stable Diffusion models trained on LAION, demonstrating that our method can achieve better or competitive performance in a few seconds, faster than existing methods by 2,500 400,000 .
Researcher Affiliation	Collaboration	1Carnegie Mellon University 2Adobe Research 3UC Berkeley
Pseudocode	No	The paper describes methods and formulations (e.g., Section 3.2, Appendix A.1, A.2, A.3) but does not present them in a structured pseudocode or algorithm block format.
Open Source Code	Yes	Our code, models, and datasets are at: https://peterwang512. github.io/Fast GDA.
Open Datasets	Yes	MSCOCO dataset: Creative Commons Attribution 4.0 License. LAION-400M: Open dataset of clip-filtered 400 million image-text pairs. Diffusion DB images: MIT License.
Dataset Splits	Yes	To build our dataset, for each query, we select the top 10k nearest neighbor candidates... We take 4900 queries for training and 100 for validation. We collect 5000 queries for training and 50 queries for validation, for a total of 101M query-training attribution ranks.
Hardware Specification	Yes	We run on a single Nvidia A100 80GB for benchmarking. Our experiments are all done by NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions optimizers like Adam W and specifies parameters like learning rate, but does not provide specific version numbers for software libraries or environments (e.g., Python 3.x, PyTorch 1.x, CUDA x.x).
Experiment Setup	Yes	Our rank model is a 3-layer MLP with hidden and output dimensions of 768. We optimize using Adam W (learning rate 10 3, default betas 0.9, 0.999, weight decay 0.01) for 10 epochs on the training set, without any additional learning-rate scheduling.