Accelerating Transformers with Spectrum-Preserving Token Merging

Authors: Chau Tran, Duy M. H. Nguyen, Manh-Duy Nguyen, TrungTin Nguyen, Ngan Le, Pengtao Xie, Daniel Sonntag, James Y. Zou, Binh Nguyen, Mathias Niepert

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental findings demonstrate that PITOME saved from 40-60% FLOPs of the base models while exhibiting superior off-the-shelf performance on image classification (0.5% average performance drop of Vi T-MAEH compared to 2.6% as baselines), image-text retrieval (0.3% average performance drop of CLIP on Flickr30k compared to 4.5% as others), and analogously in visual questions answering with LLa Va-7B.
Researcher Affiliation Academia Hoai-Chau Tran 1,2, Duy M. H. Nguyen 1,3,4, Duy M. Nguyen5, Trung Tin Nguyen6, Ngan Le7, Pengtao Xie8,9, Daniel Sonntag1,10, James Zou11, Binh T. Nguyen 2, Mathias Niepert 3,4 1German Research Center for Artificial Intelligence (DFKI), 2University of Science VNUHCM 3Max Planck Research School for Intelligent Systems (IMPRS-IS), 4University of Stuttgart, 5Dublin City University, 6University of Queensland, 7University of Arkansas, 8MBZUAI, 9UC San Diego, 10Oldenburg University, 11Stanford University.
Pseudocode Yes The pseudo-code for our method is provided in Algorithm 1 (Appendix) with complexity analysis.
Open Source Code Yes Our implementation is available at this link.
Open Datasets Yes We evaluate PITOME on the image-text retrieval task using three different backbone models CLIP [56], ALBEF [57], and BLIP [58] on two frequently used Flickr30k [59] and MSCOCO [60] datasets.
Dataset Splits No The paper provides 'No. Train' and 'No. Test' counts for various datasets in Table 9, such as 'Approximately approx 1.28 million images' for train and '50k images (50 images per class)' for test on ImageNet1k. However, it does not explicitly provide percentages or specific counts for a validation split, nor does it detail the methodology for creating these splits (e.g., random seed, stratified splitting) beyond implicitly relying on standard benchmark practices.
Hardware Specification Yes Table 4: Inference time of LLa VA-1.5-7B and LLa VA-1.5-13B models when running on five V100-GPUs and five A100-GPUs.
Software Dependencies No The paper mentions software components such as 'lmms_eval library [75]', 'Py Torch' (in complexity analysis), and specific models like 'BERT' and 'CLIP'. However, it does not provide specific version numbers for these libraries or frameworks within the main content of the paper, which is necessary for reproducible setup.
Experiment Setup Yes In experiments, we set α = 1.0 and m = 0.9 0.9 li/l, where li is the current layer index and l is the total number of encoder layers, indicating an increasing margin as tokens move to deeper layers.