InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation
Authors: Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, qiang liu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose a novel text-conditioned pipeline to turn Stable Diffusion (SD) into an ultra-fast one-step model, in which we find reflow plays a critical role in improving the assignment between noises and images. Leveraging our new pipeline, we create, to the best of our knowledge, the first one-step diffusion-based text-to-image generator with SD-level image quality, achieving an FID (Fr echet Inception Distance) of 23.3 on MS COCO 2017-5k, surpassing the previous state-of-the-art technique, progressive distillation [58], by a significant margin (37.2 23.3 in FID). |
| Researcher Affiliation | Collaboration | Xingchao Liu1 , Xiwen Zhang2, Jianzhu Ma2, Jian Peng2, Qiang Liu1 1 Department of Computer Science, University of Texas at Austin 2 Helixon Research |
| Pseudocode | Yes | Algorithm 1 Training Text-Conditioned Rectified Flow from Stable Diffusion. Algorithm 2 Distilling Text-Conditioned k-Rectified Flow for One-Step Generation. |
| Open Source Code | Yes | Codes and pre-trained models are available at github.com/gnobitab/Insta Flow. |
| Open Datasets | Yes | achieving an FID (Fr echet Inception Distance) of 23.3 on MS COCO 2017-5k, surpassing the previous state-of-the-art technique, progressive distillation [58]... On MS COCO 2014-30k, Insta Flow yields an FID of 13.1... In this section, we use the pre-trained Stable Diffusion 1.4 provided in the official open-sourced repository to initialize the weights. In our experiment, we set DT to be a subset of text prompts from laion2B-en [74], pre-processed by the same filtering as SD. |
| Dataset Splits | No | The paper discusses evaluation on MS COCO datasets for testing and fine-tuning using pre-trained models, but it does not explicitly provide the training/validation/test splits used for its own model training process. |
| Hardware Specification | Yes | On MS COCO 2014-30k, Insta Flow yields an FID of 13.1 in just 0.09 second, the best in 0.1 second regime... Inference time is measured on our machine with NVIDIA A100 GPU. ... The training of Insta Flow only costs 199 A100 GPU days. ... We use a batch size of 32 and 8 A100 GPUs for training with Adam W optimizer [48]. |
| Software Dependencies | Yes | We use Py Torch 2.0.1 and Hugging Face Diffusers 0.19.3. |
| Experiment Setup | Yes | We use a batch size of 32 and 8 A100 GPUs for training with Adam W optimizer [48]. The choice of optimizer follows the default protocol2 in Hugging Face for fine-tuning SD. For all the models, we train them for 100, 000 steps. The guidance scale α for 2-Rectified Flow is set to 1.5. The learning rate for reflow is 10 6. We warm-up the training process for 1,000 steps in both reflow and distillation. |