Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VideoMAR: Autoregressive Video Generation with Continuous Tokens

Authors: Hu Yu, Biao Gong, Hangjie Yuan, DanDan Zheng, Weilong Chai, Jingdong Chen, Kecheng Zheng, Feng Zhao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 Experiments 5.1 Implementation Details 5.2 Quantitative Comparison 5.3 Qualitative Results 5.4 Ablations
Researcher Affiliation Collaboration Hu Yu1 Biao Gong2 Hangjie Yuan2 Dan Dan Zheng2 Weilong Chai2 Jingdong Chen2 Kecheng Zheng2 Feng Zhao1 1 Mo E Key Lab of BIPC, University of Science and Technology of China 2 Independent researcher
Pseudocode No The paper describes its methodology through narrative text and diagrams (Figure 2) rather than explicit pseudocode or algorithm blocks. No section is labeled 'Pseudocode' or 'Algorithm'.
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We will soon release the code and checkpoints of our method after submission.
Open Datasets No Datasets. For image-to-video training, we employ 0.5M internal video-text pairs.
Dataset Splits No The paper mentions using '0.5M internal video-text pairs' for training but does not provide details on how this internal dataset is split into training, validation, and test sets. It also mentions evaluation on VBench-I2V but not the splits for its own training data.
Hardware Specification Yes All the weights are trained from scratch with 64 NVIDIA H20 GPUs.
Software Dependencies No The paper mentions using Adam W optimizer, Qwen2-1.5B as a text encoder, and Cosmos-Tokenizer, but does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages.
Experiment Setup Yes The Video MAR backbone consists of 36 transformer layers with a dimension of 1536. We mostly follow MAR [19] for the implementation of token-wise diffusion loss. The denoising MLP consists of 3 blocks with a dimension of 1280. We adopt the masking and diffusion schedulers from MAR [19], using a masking ratio between 0.7 and 1.0 during training, and progressively reducing it from 1.0 to 0 following a cosine schedule with 64 autoregressive steps during inference. In line with common practice [13], we train with a 1000-step noise schedule but default to 100 steps for inference. For the text prompt, following the practice in FAR [35], we employ Qwen2-1.5B [32] as our text encoder and adopt cross attention for text condition injection. For the visual tokenizer, we adopt Cosmos-Tokenizer [1]. For the first stage (256 256 resolution), we employ Cosmos-Tokenizer with 4 8 8 compression in the temporal and spatial dimensions. The temporal short-to-long curriculum learning is arranged with frame length order of (5, 13, 25). For the second stage (480 768 resolution), we employ Cosmos-Tokenizer with 8 16 16 compression. We utilize the Adam W optimizer [20] (β1 = 0.9, β2 = 0.95) with a weight decay of 0.02 and a base learning rate of 1e 4 in all experiments. All the weights are trained from scratch with 64 NVIDIA H20 GPUs.