reproducibilityindex.ai

Exploring Stochastic Autoregressive Image Modeling for Visual Representation

Authors: Yu Qi, Fan Yang, Yousong Zhu, Yufei Liu, Liwei Wu, Rui Zhao, Wei Li

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our method significantly improves the performance of autoregressive image modeling and achieves the best accuracy (83.9%) on the vanilla Vi TBase model among methods using only Image Net-1K data. Transfer performance in downstream tasks also shows that our model achieves competitive performance.
Researcher Affiliation	Collaboration	1 Tsinghua University 2 Sense Time Research 3 Institute of Automation, Chinese Academy of Sciences 4 Qing Yuan Research Institute, Shanghai Jiao Tong University, Shanghai, China qiy20@mails.tsinghua.edu.cn, yousong.zhu@nlpr.ia.ac.cn, liuyufei@tsinghua.edu.cn {yangfan1,wuliwei,zhaorui,liwei1}@sensetime.com
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/qiy20/SAIM.
Open Datasets	Yes	Our method is pretrained on the popular Image Net-1k (Russakovsky et al. 2015) dataset. We conduct object detection and instance segmentation experiments on the MS COCO dataset (Lin et al. 2014). We conduct semantic segmentation experiments on ADE20K dataset (Zhou et al. 2019).
Dataset Splits	Yes	The dataset contains 1.28 million images from the training set of 1000 classes and 50,000 images from the validation set.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running experiments, such as GPU or CPU models.
Software Dependencies	No	The paper mentions software components and models such as AdamW, Vision Transformer, Mask RCNN, and UperNet, but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	Training configurations. We use Adam W for optimization and pretraining for 300/800 epochs with the batch size being 2048. We set the base learning rate as 2e-4, with cosine learning rate decay and a 30-epoch warmup, and set the weight decay as 0.05. We do not employ drop path and dropout. A light data augmentation strategy is used: random resize cropping with a scale range of [0.67, 1] and an aspect ratio range of [3/4, 4/3], followed by random flipping and color normalization steps.