Exploring Stochastic Autoregressive Image Modeling for Visual Representation

Authors: Yu Qi, Fan Yang, Yousong Zhu, Yufei Liu, Liwei Wu, Rui Zhao, Wei Li

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method significantly improves the performance of autoregressive image modeling and achieves the best accuracy (83.9%) on the vanilla Vi TBase model among methods using only Image Net-1K data. Transfer performance in downstream tasks also shows that our model achieves competitive performance.
Researcher Affiliation Collaboration 1 Tsinghua University 2 Sense Time Research 3 Institute of Automation, Chinese Academy of Sciences 4 Qing Yuan Research Institute, Shanghai Jiao Tong University, Shanghai, China qiy20@mails.tsinghua.edu.cn, yousong.zhu@nlpr.ia.ac.cn, liuyufei@tsinghua.edu.cn {yangfan1,wuliwei,zhaorui,liwei1}@sensetime.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/qiy20/SAIM.
Open Datasets Yes Our method is pretrained on the popular Image Net-1k (Russakovsky et al. 2015) dataset. We conduct object detection and instance segmentation experiments on the MS COCO dataset (Lin et al. 2014). We conduct semantic segmentation experiments on ADE20K dataset (Zhou et al. 2019).
Dataset Splits Yes The dataset contains 1.28 million images from the training set of 1000 classes and 50,000 images from the validation set.
Hardware Specification No The paper does not provide specific details about the hardware used for running experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions software components and models such as AdamW, Vision Transformer, Mask RCNN, and UperNet, but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes Training configurations. We use Adam W for optimization and pretraining for 300/800 epochs with the batch size being 2048. We set the base learning rate as 2e-4, with cosine learning rate decay and a 30-epoch warmup, and set the weight decay as 0.05. We do not employ drop path and dropout. A light data augmentation strategy is used: random resize cropping with a scale range of [0.67, 1] and an aspect ratio range of [3/4, 4/3], followed by random flipping and color normalization steps.