Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Fast Data Attribution for Text-to-Image Models
Authors: Sheng-Yu Wang, Aaron Hertzmann, Alexei A Efros, Richard Zhang, Jun-Yan Zhu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show extensive results on both medium-scale models trained on MSCOCO and large-scale Stable Diffusion models trained on LAION, demonstrating that our method can achieve better or competitive performance in a few seconds, faster than existing methods by 2,500 400,000 . |
| Researcher Affiliation | Collaboration | 1Carnegie Mellon University 2Adobe Research 3UC Berkeley |
| Pseudocode | No | The paper describes methods and formulations (e.g., Section 3.2, Appendix A.1, A.2, A.3) but does not present them in a structured pseudocode or algorithm block format. |
| Open Source Code | Yes | Our code, models, and datasets are at: https://peterwang512. github.io/Fast GDA. |
| Open Datasets | Yes | MSCOCO dataset: Creative Commons Attribution 4.0 License. LAION-400M: Open dataset of clip-filtered 400 million image-text pairs. Diffusion DB images: MIT License. |
| Dataset Splits | Yes | To build our dataset, for each query, we select the top 10k nearest neighbor candidates... We take 4900 queries for training and 100 for validation. We collect 5000 queries for training and 50 queries for validation, for a total of 101M query-training attribution ranks. |
| Hardware Specification | Yes | We run on a single Nvidia A100 80GB for benchmarking. Our experiments are all done by NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions optimizers like Adam W and specifies parameters like learning rate, but does not provide specific version numbers for software libraries or environments (e.g., Python 3.x, PyTorch 1.x, CUDA x.x). |
| Experiment Setup | Yes | Our rank model is a 3-layer MLP with hidden and output dimensions of 768. We optimize using Adam W (learning rate 10 3, default betas 0.9, 0.999, weight decay 0.01) for 10 epochs on the training set, without any additional learning-rate scheduling. |