reproducibilityindex.ai

Lumina-Next : Making Lumina-T2X Stronger and Faster with Next-DiT

Authors: Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Xiangyang Zhu, Fu-Yun Wang, Zhanyu Ma, Xu Luo, Zehan Wang, Kaipeng Zhang, Lirui Zhao, Si Liu, Xiangyu Yue, Wanli Ouyang, Yu Qiao, Hongsheng Li, Peng Gao

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To quantitatively assess the effects of Next-Di T with the above improvements, we conduct experiments on the label-conditional Image Net-256 benchmark. We follow the training setups and evaluation protocols of Si T [59] and Flag-Di T [35]. As depicted in Figure 5, Next-Di T converges signiﬁcantly faster than both Flag-Di T and Si T evaluated by FID and Inception Score (IS).
Researcher Affiliation	Collaboration	1Shanghai AI Laboratory 2The Chinese University of Hong Kong 3HKGAI under Inno HK 4Beihang University 5Beijing University of Posts and Telecommunications
Pseudocode	Yes	Algorithm B.1 illustrates the pseudocode for combining the midpoint method and sigmoid schedule for sampling.
Open Source Code	Yes	By releasing all codes and model weights at https://github.com/Alpha-VLLM/Lumina-T2X, we aim to advance the development of next-generation generative AI capable of universal modeling.
Open Datasets	Yes	To quantitatively assess the effects of Next-Di T with the above improvements, we conduct experiments on the label-conditional Image Net-256 benchmark.
Dataset Splits	Yes	We follow the training setups and evaluation protocols of Si T [59] and Flag-Di T [35]. We conduct experiments on the Image Net-1K dataset to validate the effectiveness of Next-Di T for image recognition. We build our Next-Di T model following the architecture hyperparameters of Vi T-base [28], stacking 12 transformer layers with a hidden size of 768 and 12 attention heads. This conﬁguration ensures that our architecture has a comparable number of parameters to the original Vi T. During the ﬁxed-resolution pre-training stage, we train the models from scratch for 300 epochs with an input size of 224 224.
Hardware Specification	Yes	A100 cost: 16GP Us 45h
Software Dependencies	No	The paper mentions 'Adam W optimizer' but does not provide specific version numbers for software libraries or frameworks.
Experiment Setup	Yes	During the ﬁxed-resolution pre-training stage, we train the models from scratch for 300 epochs with an input size of 224 224. We use the Adam W optimizer with a cosine decay learning rate scheduler, setting the initial learning rate, weight decay, and batch size to 1e-3, 0.05, and 1024, respectively.