Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Tensor Product Attention Is All You Need

Authors: Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Zhen Qin, Yang Yuan, Quanquan Gu, Andrew Yao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive empirical evaluation on language modeling tasks, we demonstrate that T6 surpasses or matches the performance of standard Transformer baselines, including Multi-Head Attention (MHA), Multi-Query Attention (MQA), Grouped-Query Attention (GQA), and Multi-Head Latent Attention (MLA) across various metrics, including perplexity and a range of established evaluation benchmarks. ... 6 Experiments ... 6.1 Language Modeling Tasks ... 6.2 Experimental Results on Flash TPA Decoding
Researcher Affiliation Academia 1IIIS, Tsinghua University 2Shanghai Qi Zhi Institute 3University of California, Los Angeles 4Princeton University
Pseudocode Yes The detailed definitions of the input factorized components and the step-by-step pseudo-code for Flash TPA Decoding are provided in Algorithm 2. An optimized Triton kernel implementation is outlined in Algorithm 3 (see Appendix B.1).
Open Source Code Yes Project Page: https://github.com/tensorgi/TPA. ... The code and data required to reproduce the main experimental results are provided at https://anonymous.4open.science/r/T6-anonymous-2025.
Open Datasets Yes All experiments reported in this paper are implemented based on the nano GPT codebase [24], and we pretrain our models using the Fine Web-Edu 100B dataset [37]. ... We evaluate zero-shot and two-shot performance on standard benchmarks, including ARC [63], Bool Q [13], Hella Swag [64], OBQA [39], PIQA [4], Wino Grande [43], and MMLU [18], using the lm-evaluation-harness codebase [14].
Dataset Splits Yes The dataset contains 100 billion tokens for training and 0.1 billion tokens for validation.
Hardware Specification Yes Details on architecture hyperparameters and training hardware are shown in Appendix H.1. ... Table 9: The architecture hyper-parameters and training devices of models. Abbreviations: BS. = Batch Size, GAS. = Gradient Accumulation Steps. ... SMALL 124M 4 A100 GPUS ... MEDIUM 353M 8 A100 GPUS ... LARGE 772M 8 A100 GPUS ... XL 1.55B 8 A100 GPUS
Software Dependencies No The paper mentions software like "nano GPT codebase [24]" and "Triton [57]" but does not specify version numbers for these or other relevant libraries/frameworks.
Experiment Setup Yes We follow the nano GPT training configuration [24]. In particular, we use the Adam W [35] optimizer with (β1, β2) = (0.9, 0.95), a weight decay of 0.1, and gradient clipping at 1.0. We follow the same setting as nano GPT that the learning rate is managed by a cosine annealing scheduler [36] with 2,000 warmup steps and a (total) global batch size of 480. For the small, medium, large and XL models, we set maximum learning rates of 6 × 10−4, 3 × 10−4, 2 × 10−4, and 1 × 10−4 (respectively), and minimum learning rates of 3 × 10−5, 6 × 10−5, 1 × 10−5, and 1 × 10−5 (respectively). ... Table 9: The architecture hyper-parameters and training devices of models. Abbreviations: BS. = Batch Size, GAS. = Gradient Accumulation Steps. ... MICRO BS. ... GAS.