EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models

Authors: Rui Zhao, Hangjie Yuan, Yujie Wei, Shiwei Zhang, Yuchao Gu, Lingmin Ran, Xiang Wang, Jay Zhangjie Wu, David Junhao Zhang, Yingya Zhang, Mike Zheng Shou

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments with extensive data indicate that the model trained on generated data of the advanced model can approximate its generation capability. However, it requires large-scale samples of 10 million or more. Experimental results show that this paradigm significantly reduces the required data volume.
Researcher Affiliation Collaboration Rui Zhao1, Hangjie Yuan2, Yujie Wei2,3, Shiwei Zhang2, Yuchao Gu1, Lingmin Ran1, Xiang Wang2,4, Zhangjie Wu1, Junhao Zhang1, Yingya Zhang2, Mike Zheng Shou1, 1Show Lab, National University of Singapore 2Alibaba Group 3 Fudan University 4 Huazhong University of Science and Technology
Pseudocode No The paper describes the framework's processes and operations using textual explanations and flowcharts (e.g., Figure 2, Figure 3), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes The code and model weights are available at https://github.com/showlab/EvolveDirector.
Open Datasets Yes To explore the first question, we start with a demonstration experiment by training a relatively poor model, a Di T model [10] pre-trained on public dataset Image Net [5] and SAM [7], to approach the advanced model Pix Art-α [9] using increasing data scales.
Dataset Splits No The paper describes dynamic data curation and online evaluation with a VLM during training, which serves a similar purpose to validation. However, it does not specify traditional train/validation/test dataset splits with explicit percentages or sample counts for a static validation set.
Hardware Specification Yes We train the base model on 16 A100 GPUs for 240 GPU days, with a batch size of 128 and 32 for images at 512px and 1024px resolution, respectively. The VLM evaluation process is distributed across 8 A100 GPUs to facilitate its speed.
Software Dependencies No The paper mentions using a Diffusion Transformer (Di T) [10] and building upon the architecture of Pix Art-α [9], as well as the Adam W optimizer. However, it does not specify version numbers for general software dependencies such as Python, PyTorch, TensorFlow, or CUDA.
Experiment Setup Yes We train the base model on 16 A100 GPUs for 240 GPU days, with a batch size of 128 and 32 for images at 512px and 1024px resolution, respectively. The model is trained by the Adam W optimizer with a learning rate of 2e-5 and with a gradient clip of 1.0. A constant learning rate schedule with 1000 warm-up steps is used.