Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Intriguing Properties of Data Attribution on Diffusion Models
Authors: Xiaosen Zheng, Tianyu Pang, Chao Du, Jing Jiang, Min Lin
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we conduct extensive experiments and ablation studies on attributing diffusion models, specifically focusing on DDPMs trained on CIFAR-10 and Celeb A, as well as a Stable Diffusion model Lo RA-finetuned on Art Bench. Intriguingly, we report counter-intuitive observations that theoretically unjustified design choices for attribution empirically outperform previous baselines by a large margin, in terms of both linear datamodeling score and counterfactual evaluation. |
| Researcher Affiliation | Collaboration | 1Singapore Management University 2Sea AI Lab, Singapore EMAIL; EMAIL |
| Pseudocode | No | The paper does not contain pseudocode or a clearly labeled algorithm block. |
| Open Source Code | Yes | The code is available at https://github.com/sail-sg/D-TRAK. |
| Open Datasets | Yes | Our experiments are conducted on three datasets including CIFAR (32 Ã 32), Celeb A (64 Ã 64), and Art Bench (256 Ã 256). More details of datasets can be found in Appendix A.1. CIFAR (32 Ã 32). The CIFAR-10 dataset (Krizhevsky et al., 2009) contains 50,000 training samples. Celeb A (64 Ã 64). We sample a subset of 5,000 training samples and 1,000 validation samples from the original training set and test set of Celeb A (Liu et al., 2015) Art Bench (256 Ã 256). Art Bench (Liao et al., 2022) is a dataset for artwork generation. |
| Dataset Splits | Yes | We randomly sample 1,000 validation samples from CIFAR-10âs test set for LDS evaluation. To reduce computation, we also construct a CIFAR-2 dataset as a subset of CIFAR-10, which consists of 5,000 training samples randomly sampled from CIFAR-10âs training samples corresponding to the automobile and horse classes, and 1,000 validation samples randomly sampled from CIFAR-10âs test set corresponding to the same two classes. |
| Hardware Specification | Yes | For all of our experiments, we use 64 CPU cores and NVIDIA A100 GPUs each with 40GB of memory. |
| Software Dependencies | Yes | In this paper, we train various diffusion models for different datasets using the Diffusers library.6 We compute the per-sample gradient following a tutorial of the PyTorch library (version 2.0.1).7 We use the trak library8 to project gradients with a random projection matrix, which is implemented using a faster custom CUDA kernel.9 |
| Experiment Setup | Yes | The maximum timestep is T = 1000, and we choose the linear variance schedule for the forward diffusion process as Îē1 = 10â4 to ÎēT = 0.02. We set the dropout rate to 0.1, employ the AdamW (Loshchilov & Hutter, 2019) optimizer with weight decay of 10â6, and augment the data with random horizontal flips. A DDPM is trained for 200 epochs with a 128 batch size, using a cosine annealing learning rate schedule with a 0.1 fraction warmup and an initial learning rate of 10â4. |