Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Autoregressive Models in Vision: A Survey
Authors: Jing Xiong, Gongye Liu, Lun Huang, Chengyue Wu, Taiqiang Wu, Yao Mu, Yuan Yao, Hui Shen, Zhongwei Wan, Jinfa Huang, Chaofan Tao, Shen Yan, Huaxiu Yao, Lingpeng Kong, Hongxia Yang, Mi Zhang, Guillermo Sapiro, Jiebo Luo, Ping Luo, Ngai Wong
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | This survey comprehensively examines the literature on autoregressive models applied to vision. To improve readability for researchers from diverse research backgrounds, we start with preliminary sequence representation and modeling in vision. Next, we divide the fundamental frameworks of visual autoregressive models into three general sub-categories, including pixel-based, token-based, and scale-based models based on the representation strategy. We then explore the interconnections between autoregressive models and other generative models. Furthermore, we present a multifaceted categorization of autoregressive models in computer vision, including image generation, video generation, 3D generation, and multimodal generation. We also elaborate on their applications in diverse domains, including emerging domains such as embodied AI and 3D medical AI, with about 250 related references. Finally, we highlight the current challenges to autoregressive models in vision with suggestions about potential research directions. |
| Researcher Affiliation | Collaboration | 1The University of Hong Kong 2Tsinghua University 3Duke University 4University of Rochester 5The Ohio State University 6Bytedance 7The University of North Carolina at Chapel Hill 8Apple 9The Hong Kong Polytechnic University 10Princeton University |
| Pseudocode | No | No specific section or figure explicitly labeled 'Pseudocode' or 'Algorithm' was found in the paper. The paper contains mathematical equations and architectural diagrams, but no structured pseudocode blocks. |
| Open Source Code | No | We have also set up a Github repository to organize the papers included in this survey at: https://github.com/Chaofan Tao/Autoregressive-Models-in-Vision-Survey. |
| Open Datasets | Yes | Comparison of Model Parameters, Resolution, FID, IS, Precision, and Recall across various Types of Generative Models on Image Net dataset (Deng et al., 2009). [...] Comparison of Model Parameters, FID, CLIP-Score across various Types of Generative Models on MJHQ-30K (Li et al., 2024a) and MS-COCO (Lin et al., 2014) datasets. |
| Dataset Splits | No | The paper is a survey that reviews and compares existing models. It refers to established datasets like ImageNet, MJHQ-30K, and MS-COCO but does not specify dataset splits for any new experiments conducted within this paper, as it primarily analyzes reported results from other works. |
| Hardware Specification | No | The paper is a survey of autoregressive models in vision and does not describe any specific hardware used for conducting original experiments or generating new results. |
| Software Dependencies | No | The paper is a survey and does not describe specific software dependencies or versions used for conducting its own research or experiments. |
| Experiment Setup | No | The paper is a comprehensive survey of autoregressive models in vision and does not describe an experimental setup with specific hyperparameters or system-level training settings, as it does not present new experimental results. |