Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems
Authors: Saeed Amizadeh, Sara Abdali, Yinheng Li, Kazuhito Koishida
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present an empirical study aiming at two main goals: (1) showing the capability of the HSA mechanism in incorporating useful domain hierarchy knowledge into training better transformer models from scratch, and (2) demonstrating the unique capacity of HSA as post-training approximation of the Softmax attention in pre-trained transformer models in order to reduce the self-attention computation FLOPS in a zero-shot manner. |
| Researcher Affiliation | Industry | Saeed Amizadeh, Sara Abdali, Yinheng Li, and Kazuhito Koishida Microsoft Redmond, WA 98052 EMAIL |
| Pseudocode | Yes | Algorithms 1 3 illustrate these steps. For the formal correctness results as well as further practical details for our proposed algorithm, see Appendix E. |
| Open Source Code | No | While the datasets used in our empirical studies are publicly available, our code is not being released at the moment until after submission. |
| Open Datasets | Yes | For our empirical assessment, we have chosen the text classification problem for the sentiment analysis task on two datasets: IMDB [57, 1], and Elec [59, 2]... In order to showcase the capabilities of our proposed framework in multi-modal settings, we have performed experiments for the news classification task on N24News dataset [93] |
| Dataset Splits | Yes | For the validation set, we have used 10% of the training set... For training/validation/testing splitting, we use random splitting of ratio 8:1:1 used by the original paper. |
| Hardware Specification | Yes | In this appendix, we detail the experimental settings used for the reported experiments in the main paper, which are all completed on 2 Nvidia Titan V GPUs with 24GB GPU memory on a local Lambda box. |
| Software Dependencies | No | In particular, wi,j is the weight of the jth term for computing the summed quantity at the ith node (typically 1 or 0). As for the quantities in Algorithms 3 and 2, ยตk( ), ยตq( ) and ฮท( ) can be parallelized over all the nodes in hx; that is, in order to compute each one of these quantities for all nodes of hx, only one sparse matrix-vector multiplication is needed given the appropriate coefficient matrix. The computation of ฯ( ) and ฯ( ) is also parallelizable over the nodes belonging to the same depth in hx; in other words, given the appropriate coefficient matrices, we would need D sparse matrix-vector multiplications to calculate each one of these quantities for all nodes in hx, where D is the depth of hx. Since the coefficient matrices in this scheme are highly sparse, we have represented the coefficient matrices using sparse tensors and used the efficient implementation of sparse matrix by dense vector multiplication in Pytorch to carry out the tree-based summations in Algorithms 3 and 2. |
| Experiment Setup | Yes | Table 6: Configuration of model architectures employed in all experiments/models Table 7: The training hyperparameters used for each experiment. |