Lumina-Next : Making Lumina-T2X Stronger and Faster with Next-DiT
Authors: Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Xiangyang Zhu, Fu-Yun Wang, Zhanyu Ma, Xu Luo, Zehan Wang, Kaipeng Zhang, Lirui Zhao, Si Liu, Xiangyu Yue, Wanli Ouyang, Yu Qiao, Hongsheng Li, Peng Gao
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To quantitatively assess the effects of Next-Di T with the above improvements, we conduct experiments on the label-conditional Image Net-256 benchmark. We follow the training setups and evaluation protocols of Si T [59] and Flag-Di T [35]. As depicted in Figure 5, Next-Di T converges significantly faster than both Flag-Di T and Si T evaluated by FID and Inception Score (IS). |
| Researcher Affiliation | Collaboration | 1Shanghai AI Laboratory 2The Chinese University of Hong Kong 3HKGAI under Inno HK 4Beihang University 5Beijing University of Posts and Telecommunications |
| Pseudocode | Yes | Algorithm B.1 illustrates the pseudocode for combining the midpoint method and sigmoid schedule for sampling. |
| Open Source Code | Yes | By releasing all codes and model weights at https://github.com/Alpha-VLLM/Lumina-T2X, we aim to advance the development of next-generation generative AI capable of universal modeling. |
| Open Datasets | Yes | To quantitatively assess the effects of Next-Di T with the above improvements, we conduct experiments on the label-conditional Image Net-256 benchmark. |
| Dataset Splits | Yes | We follow the training setups and evaluation protocols of Si T [59] and Flag-Di T [35]. We conduct experiments on the Image Net-1K dataset to validate the effectiveness of Next-Di T for image recognition. We build our Next-Di T model following the architecture hyperparameters of Vi T-base [28], stacking 12 transformer layers with a hidden size of 768 and 12 attention heads. This configuration ensures that our architecture has a comparable number of parameters to the original Vi T. During the fixed-resolution pre-training stage, we train the models from scratch for 300 epochs with an input size of 224 224. |
| Hardware Specification | Yes | A100 cost: 16GP Us 45h |
| Software Dependencies | No | The paper mentions 'Adam W optimizer' but does not provide specific version numbers for software libraries or frameworks. |
| Experiment Setup | Yes | During the fixed-resolution pre-training stage, we train the models from scratch for 300 epochs with an input size of 224 224. We use the Adam W optimizer with a cosine decay learning rate scheduler, setting the initial learning rate, weight decay, and batch size to 1e-3, 0.05, and 1024, respectively. |