reproducibilityindex.ai

Intriguing Properties of Data Attribution on Diffusion Models

Authors: Xiaosen Zheng, Tianyu Pang, Chao Du, Jing Jiang, Min Lin

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we conduct extensive experiments and ablation studies on attributing diffusion models, specifically focusing on DDPMs trained on CIFAR-10 and Celeb A, as well as a Stable Diffusion model Lo RA-finetuned on Art Bench. Intriguingly, we report counter-intuitive observations that theoretically unjustified design choices for attribution empirically outperform previous baselines by a large margin, in terms of both linear datamodeling score and counterfactual evaluation.
Researcher Affiliation	Collaboration	1Singapore Management University 2Sea AI Lab, Singapore {zhengxs, tianyupang, duchao, linmin}@sea.com; jingjiang@smu.edu.sg
Pseudocode	No	The paper does not contain pseudocode or a clearly labeled algorithm block.
Open Source Code	Yes	The code is available at https://github.com/sail-sg/D-TRAK.
Open Datasets	Yes	Our experiments are conducted on three datasets including CIFAR (32 × 32), Celeb A (64 × 64), and Art Bench (256 × 256). More details of datasets can be found in Appendix A.1. CIFAR (32 × 32). The CIFAR-10 dataset (Krizhevsky et al., 2009) contains 50,000 training samples. Celeb A (64 × 64). We sample a subset of 5,000 training samples and 1,000 validation samples from the original training set and test set of Celeb A (Liu et al., 2015) Art Bench (256 × 256). Art Bench (Liao et al., 2022) is a dataset for artwork generation.
Dataset Splits	Yes	We randomly sample 1,000 validation samples from CIFAR-10’s test set for LDS evaluation. To reduce computation, we also construct a CIFAR-2 dataset as a subset of CIFAR-10, which consists of 5,000 training samples randomly sampled from CIFAR-10’s training samples corresponding to the automobile and horse classes, and 1,000 validation samples randomly sampled from CIFAR-10’s test set corresponding to the same two classes.
Hardware Specification	Yes	For all of our experiments, we use 64 CPU cores and NVIDIA A100 GPUs each with 40GB of memory.
Software Dependencies	Yes	In this paper, we train various diffusion models for different datasets using the Diffusers library.6 We compute the per-sample gradient following a tutorial of the PyTorch library (version 2.0.1).7 We use the trak library8 to project gradients with a random projection matrix, which is implemented using a faster custom CUDA kernel.9
Experiment Setup	Yes	The maximum timestep is T = 1000, and we choose the linear variance schedule for the forward diffusion process as β1 = 10−4 to βT = 0.02. We set the dropout rate to 0.1, employ the AdamW (Loshchilov & Hutter, 2019) optimizer with weight decay of 10−6, and augment the data with random horizontal flips. A DDPM is trained for 200 epochs with a 128 batch size, using a cosine annealing learning rate schedule with a 0.1 fraction warmup and an initial learning rate of 10−4.