Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

InfinityStar: Unified Spacetime AutoRegressive Modeling for Visual Generation

Authors: Jinlai Liu, Jian Han, Bin Yan, Hui Wu, Fengda Zhu, Xing Wang, Yi Jiang, BINGYUE PENG, Zehuan Yuan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, Infinity Star scores 83.74 on VBench, outperforming all autoregressive models by large margins, even surpassing diffusion competitors like Hunyuan Video. Without extra optimizations, our model generates a 5s, 720p video approximately 10 faster than leading diffusion-based methods.
Researcher Affiliation Industry EMAIL, EMAIL
Pseudocode No The paper describes the architecture and methods in text and uses diagrams like Figure 1 and Figure 2 to illustrate concepts, but these are not pseudocode or algorithm blocks. There is no section labeled 'Pseudocode' or 'Algorithm'.
Open Source Code Yes Codes and models: https://github.com/Foundation Vision/Infinity Star
Open Datasets Yes They are mainly from Panda-70M[6], Mira[14], and other internal video-text pairs.
Dataset Splits No The paper describes datasets for pre-training and fine-tuning at different resolutions and iterations, implying different training stages. However, it does not provide explicit train/validation/test splits (e.g., percentages or counts) for evaluation on any single dataset. The evaluation is performed on the VBench benchmark.
Hardware Specification No We use internal clusters to train our models. In terms of estimation, the four-stage video generation process consumes 5,000, 40,000, 30,000, and 30,000 GPU hours respectively. The video VAE requires 2,000 GPU hours. The ablation study has a total cost of 10,000 GPU hours, and the evaluation consumes 1,000 GPU hours.
Software Dependencies No The paper does not mention specific software names with version numbers, such as Python, PyTorch, or CUDA versions, which are needed to replicate the experiment.
Experiment Setup Yes The autoregressive Transformer of Infinity Star is trained progressively in four stages, including a T2I pre-training and three T2V fine-tuning on 192p, 480p, 720p respectively. Each time we increase the training resolution, we preserve scale schedule of lower resolutions and append several larger scales, which enables better inheritance. The global batch size for 192p is 2048 and that of 480p and 720p is 1024. The learning rate for 192p is 2e-4. Then we decay it to 1e-4 for 480p and 720p. We train the model on videos of 192p, 480p, 720p for 50K, 8K, 3K iterations, respectively.