IA-RED$^2$: Interpretability-Aware Redundancy Reduction for Vision Transformers

Authors: Bowen Pan, Rameswar Panda, Yifan Jiang, Zhangyang Wang, Rogerio Feris, Aude Oliva

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We include extensive experiments on both image and video tasks, where our method could deliver up to 1.4 speed-up for state-of-the-art models like Dei T [53] and Time Sformer [3], by only sacrificing less than 0.7% accuracy. We conduct image recognition experiments on the Image Net-1k classification dataset [31]. For weakly-supervised image segmentation experiments, we adopt the Image Net-Segmentation dataset [16] to evaluate the heatmaps we generate. Finally, for video action recognition, we conduct our experiments on Kinetics-400 dataset [7].
Researcher Affiliation Collaboration Bowen Pan1, Rameswar Panda2, Yifan Jiang3, Zhangyang Wang3, Rogerio Feris2, Aude Oliva1,2 1MIT CSAIL, 2MIT-IBM Watson AI Lab, 3UT Austin
Pseudocode Yes The pseudo-code for the above optimization pipeline can be referred in supplementary materials.
Open Source Code Yes Project Page: http://people.csail.mit.edu/bpan/ia-red/.
Open Datasets Yes We conduct image recognition experiments on the Image Net-1k classification dataset [31]. For weakly-supervised image segmentation experiments, we adopt the Image Net-Segmentation dataset [16] to evaluate the heatmaps we generate. Finally, for video action recognition, we conduct our experiments on Kinetics-400 dataset [7], which contains 240k training videos and 10K videos for testing across 400 classes.
Dataset Splits Yes The performance of our models on Image Net-1k is measured with the metrics of top-1 and top-5 accuracy rates. We report three metrics: pixel accuracy, mean accuracy (m Acc), and mean Io U (m Io U) to reflect the segmentation performance. We report the metrics of clip-1 and video-1 error of video models, which denotes the error rate of evaluating the model with the single clip and the Left-Center-Right three clips, respectively.
Hardware Specification Yes We train most of our models using 16 NVIDIA Tesla V100-32GB GPUs. We test the inference speed in terms of frames per second (fps) of each method on a single NVIDIA Tesla V100-32GB GPU with Py Torch 1.7 and CUDA 10.2. In contrast, our method obtains 79.1% top-1 accuracy with the inference speed of 1360 fps. Furthermore, we compare with Linformer [55] and observe that it only gets the top-1 accuracy of 75.7% on Image Net1k. These results show the efficacy of IA-RED2 over existing data-dependent sparse transformers in reducing the redundancy of vision transformers.
Software Dependencies Yes We test the inference speed in terms of frames per second (fps) of each method on a single NVIDIA Tesla V100-32GB GPU with Py Torch 1.7 and CUDA 10.2.
Experiment Setup Yes For the image recognition task, we divide the vision transformer backbone [53] into 3 (D = 3) groups, where each group contains 4 (L = 4) MSA-FFN modules and one multi-head interpreter. We optimize the entire framework for D 30 epochs. During every 30 epochs, we optimize the multi-head interpreter for 10 epochs and all of the subsequent MSA-FFN modules for 20 epochs. We use a mini-batch size of 32 images per GPU and adopt Adam [30] optimizer with an initial learning rate of 4e-5, which decays by cosine strategy [36] to train all our models. For the video understanding task, we set D = 1, i.e., we only select the informative patches at the input level. And we train the multi-head interpreter for 5 epochs and then finetune the backbone network for 1 epoch, mainly following the settings listed in the original paper [3]. We use a mini-batch size of 8 video clips per GPU and adopt an SGD optimizer with an initial learning rate of 2.5e-3 in cosine strategy [36].