Improving Diffusion-Based Image Synthesis with Context Prediction

Authors: Ling Yang, Jingwei Liu, Shenda Hong, Zhilong Zhang, Zhilin Huang, Zheming Cai, Wentao Zhang, Bin CUI

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments are conducted on unconditional image generation, text-to-image generation and image inpainting tasks. Our CONPREDIFF consistently outperforms previous methods and achieves a new SOTA text-to-image generation results on MS-COCO, with a zero-shot FID score of 6.21.
Researcher Affiliation Academia 1Peking University 2 Tsinghua University
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No These considerations guide our decision not to release the source code or a public demo at this point in time.
Open Datasets Yes Regarding unconditional image generation, we choose four popular datasets for evaluation: Celeb A-HQ [34], FFHQ [35], LSUN-Church-outdoor [102], and LSUN-bedrooms [102]. For text-to-image generation, we train the model with LAION [73, 74] and some internal datasets, and conduct evaluations on MS-COCO dataset with zero-shot FID and CLIP score [25, 59]
Dataset Splits No The paper does not provide explicit details about train/validation/test dataset splits (percentages or counts) or refer to standard splits with specific citations for all datasets used.
Hardware Specification Yes We use the standard Adam optimizer with a learning rate of 0.0001, weight decay of 0.01, and a batch size of 1024 to optimize the base model and two super-resolution models on NVIDIA A100 GPUs, respectively, equipped with multi-scale training technique (6 image scales).
Software Dependencies No The paper mentions software components like T5, CLIP, Adam optimizer, U-Net, and Transformer, but does not specify version numbers for any programming languages, libraries, or frameworks used in the experiments.
Experiment Setup Yes We use the standard Adam optimizer with a learning rate of 0.0001, weight decay of 0.01, and a batch size of 1024 to optimize the base model and two super-resolution models on NVIDIA A100 GPUs, respectively, equipped with multi-scale training technique (6 image scales). We use T = 250 time steps, and applied r = 10 times resampling with jumpy size j = 10. For unconditional generation tasks, we use the same denoising architecture like LDM [65] for fair comparison. The max channels are 224, and we use T=2000 time steps, linear noise schedule and an initial learning rate of 0.000096.