Intriguing Properties of Data Attribution on Diffusion Models
Authors: Xiaosen Zheng, Tianyu Pang, Chao Du, Jing Jiang, Min Lin
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we conduct extensive experiments and ablation studies on attributing diffusion models, specifically focusing on DDPMs trained on CIFAR-10 and Celeb A, as well as a Stable Diffusion model Lo RA-finetuned on Art Bench. Intriguingly, we report counter-intuitive observations that theoretically unjustified design choices for attribution empirically outperform previous baselines by a large margin, in terms of both linear datamodeling score and counterfactual evaluation. |
| Researcher Affiliation | Collaboration | 1Singapore Management University 2Sea AI Lab, Singapore {zhengxs, tianyupang, duchao, linmin}@sea.com; jingjiang@smu.edu.sg |
| Pseudocode | No | The paper does not contain pseudocode or a clearly labeled algorithm block. |
| Open Source Code | Yes | The code is available at https://github.com/sail-sg/D-TRAK. |
| Open Datasets | Yes | Our experiments are conducted on three datasets including CIFAR (32 × 32), Celeb A (64 × 64), and Art Bench (256 × 256). More details of datasets can be found in Appendix A.1. CIFAR (32 × 32). The CIFAR-10 dataset (Krizhevsky et al., 2009) contains 50,000 training samples. Celeb A (64 × 64). We sample a subset of 5,000 training samples and 1,000 validation samples from the original training set and test set of Celeb A (Liu et al., 2015) Art Bench (256 × 256). Art Bench (Liao et al., 2022) is a dataset for artwork generation. |
| Dataset Splits | Yes | We randomly sample 1,000 validation samples from CIFAR-10’s test set for LDS evaluation. To reduce computation, we also construct a CIFAR-2 dataset as a subset of CIFAR-10, which consists of 5,000 training samples randomly sampled from CIFAR-10’s training samples corresponding to the automobile and horse classes, and 1,000 validation samples randomly sampled from CIFAR-10’s test set corresponding to the same two classes. |
| Hardware Specification | Yes | For all of our experiments, we use 64 CPU cores and NVIDIA A100 GPUs each with 40GB of memory. |
| Software Dependencies | Yes | In this paper, we train various diffusion models for different datasets using the Diffusers library.6 We compute the per-sample gradient following a tutorial of the PyTorch library (version 2.0.1).7 We use the trak library8 to project gradients with a random projection matrix, which is implemented using a faster custom CUDA kernel.9 |
| Experiment Setup | Yes | The maximum timestep is T = 1000, and we choose the linear variance schedule for the forward diffusion process as β1 = 10−4 to βT = 0.02. We set the dropout rate to 0.1, employ the AdamW (Loshchilov & Hutter, 2019) optimizer with weight decay of 10−6, and augment the data with random horizontal flips. A DDPM is trained for 200 epochs with a 128 batch size, using a cosine annealing learning rate schedule with a 0.1 fraction warmup and an initial learning rate of 10−4. |