Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models

Authors: Zhendong Wang, Yifan Jiang, Huangjie Zheng, Peihao Wang, Pengcheng He, Zhangyang "Atlas" Wang, Weizhu Chen, Mingyuan Zhou

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct five sets of experiments to validate our patch diffusion training method. In the first subsection, we conduct an ablation study on what impacts the performance of our method. In the second subsection, we compare our method with its backbone model and other state-of-the-art diffusion model baselines on commonly-used benchmark datasets. Thirdly, we show that our method could also help improve the efficiency of finetuning large-scale pretrained models. Then, we show that patch diffusion models could achieve better generation quality on typical small datasets. Finally, we evaluate the out-painting capability of patch diffusion models.
Researcher Affiliation Collaboration Zhendong Wang1,2, Yifan Jiang1, Huangjie Zheng1,2, Peihao Wang1, Pengcheng He2, Zhangyang Wang1, Weizhu Chen2, and Mingyuan Zhou1 1The University of Texas at Austin, 2Microsoft Azure AI
Pseudocode No The paper describes its methods in prose and equations but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes We share our code and pre-trained models at https://github.com/Zhendong-Wang/Patch-Diffusion.
Open Datasets Yes Datasets. Following previous works [22, 8, 23], we select Celeb A ( 200k images) [34], FFHQ (70k images) [20], LSUN ( 200k images) [63], and Image Net ( 1.2 million images) [9] as our large datasets, and AFHQv2-Cat/Dog/Wild ( 5k images in each of them) [7] as our small datasets.
Dataset Splits No The paper mentions 'full training set used as reference' for FID calculation and 'duration of 200 million images' for training, but it does not explicitly state the specific train/validation/test splits (e.g., percentages or sample counts) used for the experiments. It relies on implied standard splits for benchmark datasets without explicit declaration.
Hardware Specification Yes We train all models on 16 Nvidia V100 GPUs with a batch size of 512 for a duration of 200 million images.
Software Dependencies No The paper states 'We implement our Patch Diffusion on top of the current state-of-the-art Unet-based diffusion model EDM-DDPM++ [22] and EDM-ADM [22].' However, it does not specify software versions for programming languages, libraries, or frameworks (e.g., Python version, PyTorch version, CUDA version).
Experiment Setup Yes We inherit the hyperparameter settings and the UNet architecture from Karras et al. [22]. We implement random cropping independently on each data point in the same batch, and the corresponding pixel coordinates are concatenated with the image channels to form the input for the UNet denoiser. When computing the loss defined in Equation (4), we ignore the reconstructed coordinate channels and only minimize the loss on the image channels. We train all models on 16 Nvidia V100 GPUs with a batch size of 512 for a duration of 200 million images. During training, we randomly drop 10% of the given class labels, and during sampling, we use strength 1.3 for applying CFG, cfg=1.3.