Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models
Authors: Zhendong Wang, Yifan Jiang, Huangjie Zheng, Peihao Wang, Pengcheng He, Zhangyang "Atlas" Wang, Weizhu Chen, Mingyuan Zhou
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct five sets of experiments to validate our patch diffusion training method. In the first subsection, we conduct an ablation study on what impacts the performance of our method. In the second subsection, we compare our method with its backbone model and other state-of-the-art diffusion model baselines on commonly-used benchmark datasets. Thirdly, we show that our method could also help improve the efficiency of finetuning large-scale pretrained models. Then, we show that patch diffusion models could achieve better generation quality on typical small datasets. Finally, we evaluate the out-painting capability of patch diffusion models. |
| Researcher Affiliation | Collaboration | Zhendong Wang1,2, Yifan Jiang1, Huangjie Zheng1,2, Peihao Wang1, Pengcheng He2, Zhangyang Wang1, Weizhu Chen2, and Mingyuan Zhou1 1The University of Texas at Austin, 2Microsoft Azure AI |
| Pseudocode | No | The paper describes its methods in prose and equations but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We share our code and pre-trained models at https://github.com/Zhendong-Wang/Patch-Diffusion. |
| Open Datasets | Yes | Datasets. Following previous works [22, 8, 23], we select Celeb A ( 200k images) [34], FFHQ (70k images) [20], LSUN ( 200k images) [63], and Image Net ( 1.2 million images) [9] as our large datasets, and AFHQv2-Cat/Dog/Wild ( 5k images in each of them) [7] as our small datasets. |
| Dataset Splits | No | The paper mentions 'full training set used as reference' for FID calculation and 'duration of 200 million images' for training, but it does not explicitly state the specific train/validation/test splits (e.g., percentages or sample counts) used for the experiments. It relies on implied standard splits for benchmark datasets without explicit declaration. |
| Hardware Specification | Yes | We train all models on 16 Nvidia V100 GPUs with a batch size of 512 for a duration of 200 million images. |
| Software Dependencies | No | The paper states 'We implement our Patch Diffusion on top of the current state-of-the-art Unet-based diffusion model EDM-DDPM++ [22] and EDM-ADM [22].' However, it does not specify software versions for programming languages, libraries, or frameworks (e.g., Python version, PyTorch version, CUDA version). |
| Experiment Setup | Yes | We inherit the hyperparameter settings and the UNet architecture from Karras et al. [22]. We implement random cropping independently on each data point in the same batch, and the corresponding pixel coordinates are concatenated with the image channels to form the input for the UNet denoiser. When computing the loss defined in Equation (4), we ignore the reconstructed coordinate channels and only minimize the loss on the image channels. We train all models on 16 Nvidia V100 GPUs with a batch size of 512 for a duration of 200 million images. During training, we randomly drop 10% of the given class labels, and during sampling, we use strength 1.3 for applying CFG, cfg=1.3. |