Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Spectral Conditioning of Attention Improves Transformer Performance
Authors: Hemanth Saratchandran, Simon Lucey
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically show that this improved Jacobian conditioning translates to enhanced performance in practice. Our approach is simple, broadly applicable, and can be easily integrated as a dropin replacement for a wide range of existing attention mechanisms. We validate its effectiveness across diverse transformer architectures and tasks, demonstrating consistent improvements in performance. |
| Researcher Affiliation | Academia | Hemanth Saratchandran Australian Institute for Machine Learning Adelaide University EMAIL Simon Lucey Australian Institute for Machine Learning Adelaide University EMAIL |
| Pseudocode | No | The paper describes the steps for applying the spectral conditioned attention in text using numbered lists and mathematical notation, but does not include a block explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | No | We use publicly available code from works within the literature that we clearly cite. Their code is all publicly available on their Git Hub. This statement refers to the code of other works used by the authors, not the release of their own code implementing spectral conditioning. |
| Open Datasets | Yes | for image classification on Image Net-1k. ... fine-tuning on the COCO 2017 dataset [18]. ... Experiments were conducted on the Long-Range Arena (LRA) benchmark [32] ... trained on The Pile dataset [9], a large-scale corpus designed for language model training... evaluate the performance on the GLUE benchmark [35] |
| Dataset Splits | Yes | We validate the theoretical results from Section 3 on a Vi T-B model trained on the Image Net-1k dataset. ... pretraining an XCi T architecture [4] on the Image Net-1k dataset, followed by fine-tuning on the COCO 2017 dataset [18]. ... Experiments were conducted on the Long-Range Arena (LRA) benchmark [32] ... trained on The Pile dataset [9] ... evaluate the performance on the GLUE benchmark [35] |
| Hardware Specification | Yes | The image classification experiments in Section 4.1 of the paper were done on Nvidia A100 GPUs. ... The experiments for Section 4.2 of the paper on object detection and instance segmentation were carried out on Nvidia A100 GPUs. ... All the experiments for the Nyströmformer on LRA benchmark results in Section 4.3 were carried out on Nvidia A100 GPUs ... The language modeling experiment in Section 4.4 were all carried out on a Nvidia A6000 GPU. |
| Software Dependencies | No | The implementation of the Vi Ts were all done using the Timm code base [36]. The architectures were all trained from scratch on the Image Net-1k dataset using the Adam W optimizer following the hyperparameters used in the original papers [7, 20, 4, 33, 6]. |
| Experiment Setup | Yes | Implementation. In all cases, we used the implementation described in Section 3.3, where fixed correction terms CQ, CK, and CV are added to the query, key and value matrices WQ, WK and WV respectively within each attention layer, with λ = 10 (see Section A.2.1 for an ablation on λ). These terms are not updated during training and therefore do not introduce any additional trainable parameters. ... All models were trained using the Adam W optimizer, see Section A.2.1. We trained each model five times with five different random seeds. |