Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SAS: Simulated Attention Score

Authors: Chuanyang Zheng, Jiankai Sun, Yihang Gao, Yuehao Wang, Peihao Wang, Jing Xiong, Liliang Ren, Hao Cheng, Janardhan Kulkarni, yelong shen, Zhangyang "Atlas" Wang, Mac Schwager, Anderson Schneider, Xiaodong Liu, Jianfeng Gao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments on a variety of datasets and tasks demonstrate the effectiveness of the proposed SAS method, achieving significant improvements over different attention variants. (...) 4 Experiment (...) Baselines. We evaluate the proposed SAS against a range of established baselines, including MHA [77], MQA [61], GQA [3], MLA [40] and TPA [100]. Datasets. Our analysis involves training language models on the Arxiv and Books3 datasets, which are frequently used benchmarks for evaluating model performance [54, 12, 37, 20]. Also, we train the model on the large-scale dataset Fin Web-Edu [45]. Experiment settings. Initially, we compare SAS with other baselines at training lengths 512, and 1024, with model size 125M decoder-only Transformers [7], whose configuration is shown in Appendix C.
Researcher Affiliation Collaboration 1Morgan Stanley 2Stanford 3Microsoft Research 4NUS 5UT Austin 6HKU
Pseudocode Yes L Implementation In this section, we present the implementation of the proposed SAS module in PyTorch for research purposes, which is consistent with the intended use [49]. import torch import torch.nn as nn import torch.nn.functional as F class Residual CNN ( nn . Module ) : (... code ...) class Residual MLP ( nn . Module ) : (... code ...) class SAS( nn . Module ) :
Open Source Code Yes L Implementation In this section, we present the implementation of the proposed SAS module in PyTorch for research purposes, which is consistent with the intended use [49]. class SAS( nn . Module ) : (... PyTorch code ...) 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We have provided the dataset in Section Experiment. And the code is shown in the Appendix L.
Open Datasets Yes Datasets. Our analysis involves training language models on the Arxiv and Books3 datasets, which are frequently used benchmarks for evaluating model performance [54, 12, 37, 20]. Also, we train the model on the large-scale dataset Fin Web-Edu [45].
Dataset Splits No The paper mentions using 'Arxiv and Books3 datasets' and 'Fin Web-Edu' but does not specify exact training, validation, and test splits (e.g., percentages or sample counts). It shows 'Validation perplexity' in figures, indicating a validation set was used, but the methodology for splitting the data is not detailed.
Hardware Specification No Appendix F, 'The Training Cost', provides a table with time costs for different model sizes (125M, 350M, 2.7B, 6.7B, 10.6B) but does not specify the type of hardware (e.g., GPU model, CPU type) used for these experiments.
Software Dependencies No Appendix L shows imports for 'torch', 'torch.nn', and 'torch.nn.functional'. Appendix C mentions 'Adam W [44] optimizer' and 'cosine annealing scheduler [43]'. However, specific version numbers for PyTorch or other libraries are not provided.
Experiment Setup Yes Experiment settings. Initially, we compare SAS with other baselines at training lengths 512, and 1024, with model size 125M decoder-only Transformers [7], whose configuration is shown in Appendix C. (...) For the experiment on the Fineweb-Edu dataset, the experiment setup is as follows: We follow the nano GPT training configuration. In particular, we use the Adam W [44] optimizer with (β1, β2) = (0.9, 0.95), a weight decay of 0.1, and gradient clipping at 1.0. We follow the same setting as nano GPT that the learning rate is managed by a cosine annealing scheduler [43]. For the model setting, we mostly follow the setting of TPA [100].