Auto-scaling Vision Transformers without Training

Authors: Wuyang Chen, Wei Huang, Xianzhi Du, Xiaodan Song, Zhangyang Wang, Denny Zhou

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our As-Vi T achieves strong performance on classification (83.5% top-1 on Image Net-1k) and detection (52.7% m AP on COCO). and Table 5 demonstrates comparisons of our As-Vi T to other models. Compared to the previous both Transformer-based and CNNbased architectures, As-Vi T achieves stateof-the-art performance with a comparable number of parameters and FLOPs.
Researcher Affiliation Collaboration 1University of Texas, Austin 2University of Technology Sydney 3Google {wuyang.chen,atlaswang}@utexas.edu weihuang.uts@gmail.com {xianzhi,xiaodansong,dennyzhou}@google.com
Pseudocode Yes Algorithm 1: Training-free Vi T Topology Search. and Algorithm 2: Training-free Auto-Scaling Vi Ts.
Open Source Code Yes Our code is available at https://github.com/VITA-Group/As Vi T.
Open Datasets Yes Our As-Vi T achieves strong performance on classification (83.5% top-1 on Image Net-1k) and detection (52.7% m AP on COCO). and We benchmark our As-Vi T on Image Net-1k (Deng et al., 2009). Object detection is conducted on COCO 2017...
Dataset Splits Yes Object detection is conducted on COCO 2017 that contains 118,000 training and 5000 validation images.
Hardware Specification Yes the end-to-end model design and scaling process costs only 12 hours on one V100 GPU. and We set the default image size as 224 224, and use Adam W (Loshchilov & Hutter, 2017) as the optimizer with cosine learning rate decay (Loshchilov & Hutter, 2016). A batch size of 1024, an initial learning rate of 0.001, and a weight decay of 0.05 are adopted.
Software Dependencies No We use Tensorflow and Keras for training implementations and conduct all training on TPUs. The paper mentions software by name but does not provide specific version numbers.
Experiment Setup Yes We set the default image size as 224 224, and use Adam W (Loshchilov & Hutter, 2017) as the optimizer with cosine learning rate decay (Loshchilov & Hutter, 2016). A batch size of 1024, an initial learning rate of 0.001, and a weight decay of 0.05 are adopted.