Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

IA-RED$^2$: Interpretability-Aware Redundancy Reduction for Vision Transformers

Authors: Bowen Pan, Rameswar Panda, Yifan Jiang, Zhangyang Wang, Rogerio Feris, Aude Oliva

NeurIPS 2021 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We include extensive experiments on both image and video tasks, where our method could deliver up to 1.4 speed-up for state-of-the-art models like Dei T [53] and Time Sformer [3], by only sacrificing less than 0.7% accuracy. We conduct image recognition experiments on the Image Net-1k classification dataset [31]. For weakly-supervised image segmentation experiments, we adopt the Image Net-Segmentation dataset [16] to evaluate the heatmaps we generate. Finally, for video action recognition, we conduct our experiments on Kinetics-400 dataset [7].
Researcher Affiliation Collaboration Bowen Pan1, Rameswar Panda2, Yifan Jiang3, Zhangyang Wang3, Rogerio Feris2, Aude Oliva1,2 1MIT CSAIL, 2MIT-IBM Watson AI Lab, 3UT Austin
Pseudocode Yes The pseudo-code for the above optimization pipeline can be referred in supplementary materials.
Open Source Code Yes Project Page: http://people.csail.mit.edu/bpan/ia-red/.
Open Datasets Yes We conduct image recognition experiments on the Image Net-1k classification dataset [31]. For weakly-supervised image segmentation experiments, we adopt the Image Net-Segmentation dataset [16] to evaluate the heatmaps we generate. Finally, for video action recognition, we conduct our experiments on Kinetics-400 dataset [7], which contains 240k training videos and 10K videos for testing across 400 classes.
Dataset Splits Yes The performance of our models on Image Net-1k is measured with the metrics of top-1 and top-5 accuracy rates. We report three metrics: pixel accuracy, mean accuracy (m Acc), and mean Io U (m Io U) to reflect the segmentation performance. We report the metrics of clip-1 and video-1 error of video models, which denotes the error rate of evaluating the model with the single clip and the Left-Center-Right three clips, respectively.
Hardware Specification Yes We train most of our models using 16 NVIDIA Tesla V100-32GB GPUs. We test the inference speed in terms of frames per second (fps) of each method on a single NVIDIA Tesla V100-32GB GPU with Py Torch 1.7 and CUDA 10.2. In contrast, our method obtains 79.1% top-1 accuracy with the inference speed of 1360 fps. Furthermore, we compare with Linformer [55] and observe that it only gets the top-1 accuracy of 75.7% on Image Net1k. These results show the efficacy of IA-RED2 over existing data-dependent sparse transformers in reducing the redundancy of vision transformers.
Software Dependencies Yes We test the inference speed in terms of frames per second (fps) of each method on a single NVIDIA Tesla V100-32GB GPU with Py Torch 1.7 and CUDA 10.2.
Experiment Setup Yes For the image recognition task, we divide the vision transformer backbone [53] into 3 (D = 3) groups, where each group contains 4 (L = 4) MSA-FFN modules and one multi-head interpreter. We optimize the entire framework for D 30 epochs. During every 30 epochs, we optimize the multi-head interpreter for 10 epochs and all of the subsequent MSA-FFN modules for 20 epochs. We use a mini-batch size of 32 images per GPU and adopt Adam [30] optimizer with an initial learning rate of 4e-5, which decays by cosine strategy [36] to train all our models. For the video understanding task, we set D = 1, i.e., we only select the informative patches at the input level. And we train the multi-head interpreter for 5 epochs and then finetune the backbone network for 1 epoch, mainly following the settings listed in the original paper [3]. We use a mini-batch size of 8 video clips per GPU and adopt an SGD optimizer with an initial learning rate of 2.5e-3 in cosine strategy [36].