Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
VideoMAR: Autoregressive Video Generation with Continuous Tokens
Authors: Hu Yu, Biao Gong, Hangjie Yuan, DanDan Zheng, Weilong Chai, Jingdong Chen, Kecheng Zheng, Feng Zhao
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experiments 5.1 Implementation Details 5.2 Quantitative Comparison 5.3 Qualitative Results 5.4 Ablations |
| Researcher Affiliation | Collaboration | Hu Yu1 Biao Gong2 Hangjie Yuan2 Dan Dan Zheng2 Weilong Chai2 Jingdong Chen2 Kecheng Zheng2 Feng Zhao1 1 Mo E Key Lab of BIPC, University of Science and Technology of China 2 Independent researcher |
| Pseudocode | No | The paper describes its methodology through narrative text and diagrams (Figure 2) rather than explicit pseudocode or algorithm blocks. No section is labeled 'Pseudocode' or 'Algorithm'. |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We will soon release the code and checkpoints of our method after submission. |
| Open Datasets | No | Datasets. For image-to-video training, we employ 0.5M internal video-text pairs. |
| Dataset Splits | No | The paper mentions using '0.5M internal video-text pairs' for training but does not provide details on how this internal dataset is split into training, validation, and test sets. It also mentions evaluation on VBench-I2V but not the splits for its own training data. |
| Hardware Specification | Yes | All the weights are trained from scratch with 64 NVIDIA H20 GPUs. |
| Software Dependencies | No | The paper mentions using Adam W optimizer, Qwen2-1.5B as a text encoder, and Cosmos-Tokenizer, but does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages. |
| Experiment Setup | Yes | The Video MAR backbone consists of 36 transformer layers with a dimension of 1536. We mostly follow MAR [19] for the implementation of token-wise diffusion loss. The denoising MLP consists of 3 blocks with a dimension of 1280. We adopt the masking and diffusion schedulers from MAR [19], using a masking ratio between 0.7 and 1.0 during training, and progressively reducing it from 1.0 to 0 following a cosine schedule with 64 autoregressive steps during inference. In line with common practice [13], we train with a 1000-step noise schedule but default to 100 steps for inference. For the text prompt, following the practice in FAR [35], we employ Qwen2-1.5B [32] as our text encoder and adopt cross attention for text condition injection. For the visual tokenizer, we adopt Cosmos-Tokenizer [1]. For the first stage (256 256 resolution), we employ Cosmos-Tokenizer with 4 8 8 compression in the temporal and spatial dimensions. The temporal short-to-long curriculum learning is arranged with frame length order of (5, 13, 25). For the second stage (480 768 resolution), we employ Cosmos-Tokenizer with 8 16 16 compression. We utilize the Adam W optimizer [20] (β1 = 0.9, β2 = 0.95) with a weight decay of 0.02 and a base learning rate of 1e 4 in all experiments. All the weights are trained from scratch with 64 NVIDIA H20 GPUs. |