A Simple Romance Between Multi-Exit Vision Transformer and Token Reduction

Authors: Dongyang Liu, Meina Kan, Shiguang Shan, Xilin CHEN

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments substantiate both the existence and effectiveness of the newfound chemistry. Comparative assessments also indicate that METR outperforms state-of-the-art token reduction methods on standard benchmarks, especially under aggressive reduction ratios. (Abstract) / In this section, we first validate the existence of the claimed special chemistry between multi-exit architecture and token reduction in Sec. 4.1. We then delve into the design choices of METR and analyze its performance in different scenarios in Sec. 4.2. Finally, we compare METR with existing token reduction methods to prove its effectiveness in Sec. 4.3. Experiemnts are conducted on the Image Net (Deng et al., 2009) dataset using Dei T (Touvron et al., 2021) and MAE (He et al., 2022) models. (Section 4)
Researcher Affiliation Academia Dongyang Liu1,2, Meina Kan1,2, Shiguang Shan1,2,3, Xilin CHEN1,2 1 Key Lab of Intell. Info. Process., Inst. of Comput. Tech., CAS 2 University of Chinese Academy of Sciences 3 Peng Cheng Laboratory {liudongyang21s, kanmeina, sgshan, xlchen}@ict.ac.cn
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any concrete access information for open-source code (e.g., repository links or explicit statements about code release).
Open Datasets Yes Experiemnts are conducted on the Image Net (Deng et al., 2009) dataset using Dei T (Touvron et al., 2021) and MAE (He et al., 2022) models. (Section 4) / Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248 255, 2009.
Dataset Splits Yes Top-1 accuracy (%) on Image Net validation set is reported. (Table 3 caption) / We first fine-tune the officially pre-trained Dei T-S model with Ltotal in Eq. 10 for 30 epochs, without incorporating any token reduction operations. Subsequently, we directly add the token reduction operation to the model at inference time. (Section 4.1.1)
Hardware Specification Yes All experiments are conducted on a server with 8 NVIDIA RTX 4090 GPUs. (Appendix A)
Software Dependencies No The paper mentions optimizers (Adam W) and learning rate schedulers, but does not provide specific version numbers for software libraries or frameworks used (e.g., PyTorch version, Python version, CUDA version).
Experiment Setup Yes The model is then trained with effective batch size 2048, actual learning rate (after proportional scaling w.r.t. effective batch size) 8e-5, minimum learning 2e-6, without learning rate warm up. The cosine learning rate scheduler is adopted. The optimizer is Adam W, with weight decay set to 1e-6. For self-distillation, the temperature for distillation loss (Hinton et al., 2015) is set to 1, and the weight, namely α in Eq. 10, is set to 1.0; when Lme instead of Lsd is used for ablation studies, the weight for Lme is also 1.0 (Appendix A.1) / For the 100-epoch finetuning setting, the learning rate is enlarged to 8e-4, and the other hyperparameters are left unchanged. (Appendix A.1) / For the from-scratch training setting, the model is trained for 300 epochs. Consistent to the original setting of Dei T, the learning rate and minimum learning rate are set to 2e-3 and 1e-5, the number of warm-up epochs is set to 5, and weight decay is set to 0.05. (Appendix A.1)