Lumina-Next : Making Lumina-T2X Stronger and Faster with Next-DiT

Authors: Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Xiangyang Zhu, Fu-Yun Wang, Zhanyu Ma, Xu Luo, Zehan Wang, Kaipeng Zhang, Lirui Zhao, Si Liu, Xiangyu Yue, Wanli Ouyang, Yu Qiao, Hongsheng Li, Peng Gao

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To quantitatively assess the effects of Next-Di T with the above improvements, we conduct experiments on the label-conditional Image Net-256 benchmark. We follow the training setups and evaluation protocols of Si T [59] and Flag-Di T [35]. As depicted in Figure 5, Next-Di T converges significantly faster than both Flag-Di T and Si T evaluated by FID and Inception Score (IS).
Researcher Affiliation Collaboration 1Shanghai AI Laboratory 2The Chinese University of Hong Kong 3HKGAI under Inno HK 4Beihang University 5Beijing University of Posts and Telecommunications
Pseudocode Yes Algorithm B.1 illustrates the pseudocode for combining the midpoint method and sigmoid schedule for sampling.
Open Source Code Yes By releasing all codes and model weights at https://github.com/Alpha-VLLM/Lumina-T2X, we aim to advance the development of next-generation generative AI capable of universal modeling.
Open Datasets Yes To quantitatively assess the effects of Next-Di T with the above improvements, we conduct experiments on the label-conditional Image Net-256 benchmark.
Dataset Splits Yes We follow the training setups and evaluation protocols of Si T [59] and Flag-Di T [35]. We conduct experiments on the Image Net-1K dataset to validate the effectiveness of Next-Di T for image recognition. We build our Next-Di T model following the architecture hyperparameters of Vi T-base [28], stacking 12 transformer layers with a hidden size of 768 and 12 attention heads. This configuration ensures that our architecture has a comparable number of parameters to the original Vi T. During the fixed-resolution pre-training stage, we train the models from scratch for 300 epochs with an input size of 224 224.
Hardware Specification Yes A100 cost: 16GP Us 45h
Software Dependencies No The paper mentions 'Adam W optimizer' but does not provide specific version numbers for software libraries or frameworks.
Experiment Setup Yes During the fixed-resolution pre-training stage, we train the models from scratch for 300 epochs with an input size of 224 224. We use the Adam W optimizer with a cosine decay learning rate scheduler, setting the initial learning rate, weight decay, and batch size to 1e-3, 0.05, and 1024, respectively.