Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Towards Interpretable and Efficient Attention: Compressing All by Contracting a Few
Authors: Qishuai Wen, Zhiyuan Huang, Chun-Guang Li
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments to demonstrate comparable performance and superior advantages over black-box attention mechanisms on visual tasks. Our work sheds light on the integration of interpretability and efficiency, as well as the unified formula of attention mechanisms. Code is available at this https URL. 1 Introduction Attention mechanisms have been widely applied across diverse areas, including computer vision [1, 2], natural language processing [3, 4], and scientific discovery [5]. |
| Researcher Affiliation | Academia | Qishuai Wen, Zhiyuan Huang, and Chun-Guang Li School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, P.R. China EMAIL |
| Pseudocode | Yes | D Py Torch implementation Algorithm 1: Py Torch implementation of CBSA (8) |
| Open Source Code | Yes | Code is available at this https URL. |
| Open Datasets | Yes | We pretrain CBT models on the Image Net-1k dataset, and finetune them on several downstream datasets. The top-1 accuracy on validation sets is reported in Table 3. In particular, our CBT-Small achieves comparable top-1 accuracy to Vi T-S using only 30% of the parameters and 40% of the FLOPs. [...] We evaluate the performance of them on the ADE20K dataset [53] and show the results in the left panel of Fig. 8. |
| Dataset Splits | Yes | We pretrain CBT models on the Image Net-1k dataset, and finetune them on several downstream datasets. The top-1 accuracy on validation sets is reported in Table 3. [...] We evaluate the zero-shot segmentation performance of Image Net-1K pretrained models on the PASCAL VOC12 validation set [66]. |
| Hardware Specification | No | The paper discusses FLOPs and throughput comparisons but does not explicitly state the specific GPU or CPU models used for their experiments within the paper text. |
| Software Dependencies | No | Appendix D mentions "Py Torch implementation" and Appendix C mentions "Lion optimizer [64]" and "Adam W optimizer [65]", but no specific version numbers are provided for PyTorch or the optimizers themselves. |
| Experiment Setup | Yes | C More experimental results Training setup. We train the CBT models in Table 3 and all models in Table 4 150 epochs with the Lion optimizer [64]. The learning rate is 2.0 10 4, the weight decay coefficient is 0.05, and the batch size is 256. We also incorporate a warm-up strategy over the first 20 epochs. For data augmentation, we adopt a rather simple choice: just random cropping and random horizontal flipping. We apply label smoothing with a smoothing coefficient of 0.1. For fine-tuning, we use the Adam W optimizer [65], a learning rate of 5 10 5, weight decay of 0.01, and batch size 64. The settings above are largely inherited from [23]. |