Inner Classifier-Free Guidance and Its Taylor Expansion for Diffusion Models

Authors: Shikun Sun, Longhui Wei, Zhicai Wang, Zixuan Wang, Junliang Xing, Jia Jia, Qi Tian

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The evaluation of the results, presented in Table 1, Table 2 and Table 3 is based on two metrics: the Fr echet Inception Distance (FID) (Heusel et al., 2017) and the CLIP Score (Radford et al., 2021). The FID metric is calculated by comparing 10,000 generated images with the MS-COCO (Lin et al., 2014) validation dataset, measuring the distance between the distribution of generated images and the distribution of the validation dataset. On the other hand, the CLIP Score is computed between the 10,000 generated images and their corresponding captions by the model Vi T-L/14 (Radford et al., 2021), reflecting the similarity between the images and the textual descriptions.
Researcher Affiliation Collaboration 1Tsinghua University, 2BNRist, 3Huawei Inc., 4University of Science and Technology of China
Pseudocode Yes Algorithm 1 Training policy for ICFG; Algorithm 2 Strict sample algorithm for second-order ICFG; Algorithm 3 Non-strict sample algorithm for second-order ICFG
Open Source Code No The paper does not provide a direct link or explicit statement about the public release of its source code.
Open Datasets Yes The FID metric is calculated by comparing 10,000 generated images with the MS-COCO (Lin et al., 2014) validation dataset, measuring the distance between the distribution of generated images and the distribution of the validation dataset.
Dataset Splits Yes The FID metric is calculated by comparing 10,000 generated images with the MS-COCO (Lin et al., 2014) validation dataset, measuring the distance between the distribution of generated images and the distribution of the validation dataset. On the other hand, the CLIP Score is computed between the 10,000 generated images and their corresponding captions by the model Vi T-L/14 (Radford et al., 2021), reflecting the similarity between the images and the textual descriptions.
Hardware Specification Yes We conducted our experiments on an NVIDIA Ge Force RTX 3090, using a batch size of 4.
Software Dependencies No The paper mentions software components such as PNDM, Adam optimizer, U-Net, CLIP model, Stable Diffusion v1.5, and Low-Rank Adaptation, but does not provide specific version numbers for these or the programming language/frameworks used.
Experiment Setup Yes The sampling algorithm employed is PNDM (Liu et al., 2022), and the default number of timesteps is 50.