EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models
Authors: Rui Zhao, Hangjie Yuan, Yujie Wei, Shiwei Zhang, Yuchao Gu, Lingmin Ran, Xiang Wang, Jay Zhangjie Wu, David Junhao Zhang, Yingya Zhang, Mike Zheng Shou
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments with extensive data indicate that the model trained on generated data of the advanced model can approximate its generation capability. However, it requires large-scale samples of 10 million or more. Experimental results show that this paradigm significantly reduces the required data volume. |
| Researcher Affiliation | Collaboration | Rui Zhao1, Hangjie Yuan2, Yujie Wei2,3, Shiwei Zhang2, Yuchao Gu1, Lingmin Ran1, Xiang Wang2,4, Zhangjie Wu1, Junhao Zhang1, Yingya Zhang2, Mike Zheng Shou1, 1Show Lab, National University of Singapore 2Alibaba Group 3 Fudan University 4 Huazhong University of Science and Technology |
| Pseudocode | No | The paper describes the framework's processes and operations using textual explanations and flowcharts (e.g., Figure 2, Figure 3), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | The code and model weights are available at https://github.com/showlab/EvolveDirector. |
| Open Datasets | Yes | To explore the first question, we start with a demonstration experiment by training a relatively poor model, a Di T model [10] pre-trained on public dataset Image Net [5] and SAM [7], to approach the advanced model Pix Art-α [9] using increasing data scales. |
| Dataset Splits | No | The paper describes dynamic data curation and online evaluation with a VLM during training, which serves a similar purpose to validation. However, it does not specify traditional train/validation/test dataset splits with explicit percentages or sample counts for a static validation set. |
| Hardware Specification | Yes | We train the base model on 16 A100 GPUs for 240 GPU days, with a batch size of 128 and 32 for images at 512px and 1024px resolution, respectively. The VLM evaluation process is distributed across 8 A100 GPUs to facilitate its speed. |
| Software Dependencies | No | The paper mentions using a Diffusion Transformer (Di T) [10] and building upon the architecture of Pix Art-α [9], as well as the Adam W optimizer. However, it does not specify version numbers for general software dependencies such as Python, PyTorch, TensorFlow, or CUDA. |
| Experiment Setup | Yes | We train the base model on 16 A100 GPUs for 240 GPU days, with a batch size of 128 and 32 for images at 512px and 1024px resolution, respectively. The model is trained by the Adam W optimizer with a learning rate of 2e-5 and with a gradient clip of 1.0. A constant learning rate schedule with 1000 warm-up steps is used. |