Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction
Authors: Siyu Jiao, Gengwei Zhang, Yinlong Qian, Jiancheng Huang, Yao Zhao, Humphrey Shi, Lin Ma, Yunchao Wei, Zequn Jie
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our 1.0B model outperforms its VAR counterpart on the Image Net 256 256 benchmark. Moreover, when zero-shot transfer the image generation process with 13 steps, the performance further improves to 2.08 FID, outperforming state-of-the-art autoregressive models Ai M/VAR by 0.25/0.28 FID and popular diffusion models LDM/Di T by 1.52/0.19 FID, respectively. When transferring our 1.0B model to the Image Net 512 512 benchmark in a zero-shot manner, Flex VAR achieves competitive results compared to the VAR 2.3B model, which is a fully supervised model trained at 512 512 resolution. [...] We conduct ablation studies on various design choices in Flex VAR. |
| Researcher Affiliation | Collaboration | Siyu Jiao1 Gengwei Zhang2 Yinlong Qian3 Jiancheng Huang3 Yao Zhao1 Humphrey Shi4 Lin Ma3 Yunchao Wei1 Zequn Jie3 1 Institute of Information Science, Beijing Jiaotong University 2 University of Technology Sydney 3 Meituan 4 Georgia Institute of Technology |
| Pseudocode | No | The paper describes the methodology in Section 3 and its subsections, including mathematical formulations and conceptual descriptions of the model components, but it does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code: Flex VAR [...] The code will be released. |
| Open Datasets | Yes | The training is on Open Images [21] with a constant learning rate of 10 4, Adam W optimizer with β1 = 0.9, β2 = 0.95, weight decay = 0.05, a batch size of 128, and for 20 epochs. [...] Flex VAR is trained on the Image Net-1K 256 256 using 80GB A100 GPUs. |
| Dataset Splits | Yes | The training is on Open Images [21] with a constant learning rate of 10 4, Adam W optimizer with β1 = 0.9, β2 = 0.95, weight decay = 0.05, a batch size of 128, and for 20 epochs. [...] Flex VAR is trained on the Image Net-1K 256 256 using 80GB A100 GPUs. [...] The image reconstruction quality is measured by r-FID, reconstruction-FID on Image Net validation set. |
| Hardware Specification | Yes | Flex VAR is trained on the Image Net-1K 256 256 using 80GB A100 GPUs. |
| Software Dependencies | No | The training process employs the Adam W optimizer with β1 = 0.9, β2 = 0.95, and a weight decay rate of 0.05. The learning rate is set to 1e-4, with the training epochs varying between 180 and 350 depending on the model scale. No specific software versions for libraries or programming languages are mentioned. |
| Experiment Setup | Yes | Our scalable VQVAE tokenizer is configured with a downsampling factor of 16 and is initialized with the pre-trained weights from Llama Gen [41], the codebook size is set to 8912, and the latent space dimension is set to 32. [...] The training is on Open Images [21] with a constant learning rate of 10 4, Adam W optimizer with β1 = 0.9, β2 = 0.95, weight decay = 0.05, a batch size of 128, and for 20 epochs. K is set to 5 by default, indicating that each latent space is randomly sampled into 5 different resolutions. [...] We provide Flex VAR in three scales, with detailed configurations for each scale provided in Tab 1. Flex VAR is trained on the Image Net-1K 256 256 using 80GB A100 GPUs. The training process employs the Adam W optimizer with β1 = 0.9, β2 = 0.95, and a weight decay rate of 0.05. The learning rate is set to 1e-4, with the training epochs varying between 180 and 350 depending on the model scale. [...] During training, we randomly sample the scale size in each step to enhance Flex VAR s capability to perceive any scale. Specifically, we set the maximum number of steps to 10, fixing the scale size of the first step to 1 1 and the last step to 16 16 (corresponding to 256 256 input images), and randomly sampling the scale sizes for the intermediate steps. Each step is dropped with a 5% probability, with a maximum of 4 steps being dropped. Thus, the number of steps during training is from 6 to 10. During inference, we use a default of 10 steps: {1, 2, 3, 4, 5, 6, 8, 10, 13, 16} (same as VAR). |