Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
PaTH Attention: Position Encoding via Accumulating Householder Transformations
Authors: Songlin Yang, Yikang Shen, Kaiyue Wen, Shawn Tan, Mayank Mishra, Liliang Ren, Rameswar Panda, Yoon Kim
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Across both targeted synthetic benchmarks and moderate-scale real-world language modeling experiments, we find that Pa TH improves upon Ro PE and other recent baselines. Finally, we show that we can convert pretrained Ro PE transformers into Pa TH with continued pretraining. Empirical results show that Pa TH-based models can solve challenging synthetic state-tracking tasks where Ro PE-based Transformers struggle. On moderate-scale language modeling with 760M-parameter Transformers, Pa TH outperforms both Ro PE and the Forgetting Transformer [39], which modulates attention logits via a data-dependent additive term. Section 4 is dedicated to "Experiments". |
| Researcher Affiliation | Collaboration | 1Massachusetts Institute of Technology 2MIT-IBM Watson AI Lab 3Stanford University 4Microsoft |
| Pseudocode | No | The paper describes the steps for efficient training in a bulleted list format in Section 3.3, titled "Efficient Training", but it does not include a clearly labeled "Pseudocode" or "Algorithm" block. |
| Open Source Code | Yes | The implementation of the Pa TH attention layer is also made available as part of the FLASHLINEARATTENTION library [80, 79]: https://github.com/fla-org/flash-linear-attention. We have open-sourced the Triton kernel at https://github.com/fla-org/flash-linear-attention, and and our experiments can be reproduced using our maintained training framework https://github.com/fla-org/flame. |
| Open Datasets | Yes | We pretrain language models with 760M parameters on the Fineweb-Edu corpus [54] for 50B tokens using the Mistral tokenizer and a sequence length of 4096. We then evaluate the pretrained models on the following benchmarks...LAMBADA [LMB.; 53] (Open AI version), Pi QA [6], Hella Swag [Hella.; 83], Wino Grande [Wino.; 64], ARC-easy (ARC-e) and ARC-challenge (Arc-c) [10]. Figure 3 presents results on three long-context corpora from different domains: PG-19 [62] (books), Code Parrot (code), and Narrative QA [31](conversational English). Table 3 summarizes results on four challenging long-context benchmarks: RULER [23], BABILONG [33], Phone Book [26], and Long Bench-E [3]....from the DCLM corpus [37]...Python-Edu (code), and Mega Math Web (math) corpora [87]. |
| Dataset Splits | No | The paper uses several benchmark datasets and mentions training on the Fineweb-Edu corpus, but it does not explicitly provide specific percentages, sample counts, or detailed methodologies for training, validation, and test splits for any of its experiments, including synthetic tasks or language modeling evaluations. |
| Hardware Specification | Yes | We implement the Pa TH attention kernel in Triton [75] and benchmark its runtime on a single H100 GPU against Fo X and standard Ro PE attention under identical settings... Each 760M model is trained on 8 H100 GPUs for 2-3 days. For synthetic tasks, we use A100 GPUs, completing training within several hours. |
| Software Dependencies | No | The paper mentions using "Triton [75]" for kernel implementation and the "Mistral tokenizer", but it does not provide specific version numbers for these or any other key software components or libraries used in the experiments. |
| Experiment Setup | Yes | All models are trained with Adam W [46], using a cosine learning rate schedule with a 1B-token warmup. The peak learning rate is 1e-3, with both initial and final rates set to 3e-5. We apply a weight decay of 0.01 and gradient clipping of 1.0. The batch size is 2M tokens. Parameters are initialized with a standard deviation of 0.02. Each 760M model is trained on 8 H100 GPUs for 2-3 days. For synthetic tasks, we use A100 GPUs, completing training within several hours. |