Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems

Authors: Saeed Amizadeh, Sara Abdali, Yinheng Li, Kazuhito Koishida

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present an empirical study aiming at two main goals: (1) showing the capability of the HSA mechanism in incorporating useful domain hierarchy knowledge into training better transformer models from scratch, and (2) demonstrating the unique capacity of HSA as post-training approximation of the Softmax attention in pre-trained transformer models in order to reduce the self-attention computation FLOPS in a zero-shot manner.
Researcher Affiliation	Industry	Saeed Amizadeh, Sara Abdali, Yinheng Li, and Kazuhito Koishida Microsoft Redmond, WA 98052 EMAIL
Pseudocode	Yes	Algorithms 1 3 illustrate these steps. For the formal correctness results as well as further practical details for our proposed algorithm, see Appendix E.
Open Source Code	No	While the datasets used in our empirical studies are publicly available, our code is not being released at the moment until after submission.
Open Datasets	Yes	For our empirical assessment, we have chosen the text classification problem for the sentiment analysis task on two datasets: IMDB [57, 1], and Elec [59, 2]... In order to showcase the capabilities of our proposed framework in multi-modal settings, we have performed experiments for the news classification task on N24News dataset [93]
Dataset Splits	Yes	For the validation set, we have used 10% of the training set... For training/validation/testing splitting, we use random splitting of ratio 8:1:1 used by the original paper.
Hardware Specification	Yes	In this appendix, we detail the experimental settings used for the reported experiments in the main paper, which are all completed on 2 Nvidia Titan V GPUs with 24GB GPU memory on a local Lambda box.
Software Dependencies	No	In particular, wi,j is the weight of the jth term for computing the summed quantity at the ith node (typically 1 or 0). As for the quantities in Algorithms 3 and 2, µk( ), µq( ) and η( ) can be parallelized over all the nodes in hx; that is, in order to compute each one of these quantities for all nodes of hx, only one sparse matrix-vector multiplication is needed given the appropriate coefficient matrix. The computation of ϕ( ) and ϑ( ) is also parallelizable over the nodes belonging to the same depth in hx; in other words, given the appropriate coefficient matrices, we would need D sparse matrix-vector multiplications to calculate each one of these quantities for all nodes in hx, where D is the depth of hx. Since the coefficient matrices in this scheme are highly sparse, we have represented the coefficient matrices using sparse tensors and used the efficient implementation of sparse matrix by dense vector multiplication in Pytorch to carry out the tree-based summations in Algorithms 3 and 2.
Experiment Setup	Yes	Table 6: Configuration of model architectures employed in all experiments/models Table 7: The training hyperparameters used for each experiment.