Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Tropical Attention: Neural Algorithmic Reasoning for Combinatorial Algorithms

Authors: Baran Hashemi, Kurt Pasque, Chris Teska, Ruriko Yoshida

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we show that Tropical Attention delivers stronger out-of-distribution generalization in both length and value, with high robustness against perturbative noise, and substantially faster inference with fewer parameters compared to Softmax-based and recurrent attention baselines, respectively. For the first time, we push the domain of neural algorithmic reasoning beyond PTIME problems to NP-hard/complete problems, paving the way toward sharper and more expressive Large Reasoning Models (LRMs) capable of tackling complex combinatorial challenges in Phylogenetics, Cryptography, Particle Physics, and Mathematical Discovery. The code is available at https://github.com/Baran-phys/Tropical-Attention/.
Researcher Affiliation	Academia	Baran Hashemi Origins Data Science Lab Technical University of Munich Munich, Germany EMAIL Kurt Pasque Naval Postgraduate School Monterey, California EMAIL Chris Teska Naval Postgraduate School Monterey, California EMAIL Ruriko Yoshida Naval Postgraduate School Monterey, California EMAIL
Pseudocode	Yes	Algorithm 1 Comparison between vanilla attention and Tropical Attention
Open Source Code	Yes	The code is available at https://github.com/Baran-phys/Tropical-Attention/.
Open Datasets	No	For our experiments we custom generated both train and test datasets following procedures from the canonical algorithmic reasoning benchmark CLRS [43]. This decision was due largely to the absence of NP-hard and NP-complete problems in the CLRS benchmark, but also because our framework is designed for sequence and set-based data modalities and our OOD evaluation includes two extra new evaluation pipelines, adversarial perturbations and value generalization. All datasets, generation scripts, and OOD protocols are described in Appendix E and F. Code used to generate data can be found in our public repository: https://github.com/Baran-phys/Tropical-Attention/blob/main/dataloaders.py
Dataset Splits	Yes	For Length OOD, all models were trained on sequence length of 8 and we evaluated them at sequence length of 64, with the exception of the graph problems (Floyd Warshall and SCC), which were evaluated on sequence length of 16. For Perturbative Noise OOD, each input was perturbed with probability 0.5 with a random integer sampled from the task s perturbative noise range. Table 5: Training hyperparameters and data ranges for each combinatorial task. Each task was trained with 10M samples, learning rate of 0.0001, input sequence length of 8, and no perturbations.
Hardware Specification	Yes	We use one NVIDIA Tesla V100 GPU to train each model.
Software Dependencies	No	The primary packages utilized in constructing our experiment is Pytorch [74], Pandas and Scipy [75], Sci Kit Learn [76], and Numpy [77]. The basic workflow is described below:
Experiment Setup	Yes	To ensure a fair comparison, all variants share identical backbone hyperparameters: depth, width, and number of heads. The only architectural difference is the attention kernel. Crucially, no model sees OOD examples during optimization. We follow a uniform procedure in which each model is trained from scratch under the same training regime with task specific fixed input sequence lengths and value ranges. Table 5: Training hyperparameters and data ranges for each combinatorial task. Each task was trained with 10M samples, learning rate of 0.0001, input sequence length of 8, and no perturbations.