Inner Classifier-Free Guidance and Its Taylor Expansion for Diffusion Models
Authors: Shikun Sun, Longhui Wei, Zhicai Wang, Zixuan Wang, Junliang Xing, Jia Jia, Qi Tian
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The evaluation of the results, presented in Table 1, Table 2 and Table 3 is based on two metrics: the Fr echet Inception Distance (FID) (Heusel et al., 2017) and the CLIP Score (Radford et al., 2021). The FID metric is calculated by comparing 10,000 generated images with the MS-COCO (Lin et al., 2014) validation dataset, measuring the distance between the distribution of generated images and the distribution of the validation dataset. On the other hand, the CLIP Score is computed between the 10,000 generated images and their corresponding captions by the model Vi T-L/14 (Radford et al., 2021), reflecting the similarity between the images and the textual descriptions. |
| Researcher Affiliation | Collaboration | 1Tsinghua University, 2BNRist, 3Huawei Inc., 4University of Science and Technology of China |
| Pseudocode | Yes | Algorithm 1 Training policy for ICFG; Algorithm 2 Strict sample algorithm for second-order ICFG; Algorithm 3 Non-strict sample algorithm for second-order ICFG |
| Open Source Code | No | The paper does not provide a direct link or explicit statement about the public release of its source code. |
| Open Datasets | Yes | The FID metric is calculated by comparing 10,000 generated images with the MS-COCO (Lin et al., 2014) validation dataset, measuring the distance between the distribution of generated images and the distribution of the validation dataset. |
| Dataset Splits | Yes | The FID metric is calculated by comparing 10,000 generated images with the MS-COCO (Lin et al., 2014) validation dataset, measuring the distance between the distribution of generated images and the distribution of the validation dataset. On the other hand, the CLIP Score is computed between the 10,000 generated images and their corresponding captions by the model Vi T-L/14 (Radford et al., 2021), reflecting the similarity between the images and the textual descriptions. |
| Hardware Specification | Yes | We conducted our experiments on an NVIDIA Ge Force RTX 3090, using a batch size of 4. |
| Software Dependencies | No | The paper mentions software components such as PNDM, Adam optimizer, U-Net, CLIP model, Stable Diffusion v1.5, and Low-Rank Adaptation, but does not provide specific version numbers for these or the programming language/frameworks used. |
| Experiment Setup | Yes | The sampling algorithm employed is PNDM (Liu et al., 2022), and the default number of timesteps is 50. |