Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
DeltaFormer: Unlock the state space of Transformer
Authors: Mingyu Xu, Tenglong Ao, Jiaao He, Jianqiao Lu, Guang Shi, Mingwu Zheng
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We theoretically and empirically demonstrate that this new architecture can overcome the inherent TC0 expressivity limitations of standard Transformers, while remaining at least as effective in language modeling tasks. Section 4 is titled "Experiment" and details empirical validations on tasks like element tracking, graph reachability, and language modeling, including performance metrics and comparisons (e.g., "Across almost all reasonably simple choices of Îș1( ), Delta Former achieved better results than the Transformer."). |
| Researcher Affiliation | Industry | Mingyu Xu Seed Team, Byte Dance EMAIL Tenglong Ao* Seed Team, Byte Dance EMAIL Jiaao He Seed Team, Byte Dance EMAIL Jianqiao Lu Seed Team, Byte Dance EMAIL Guang Shi Seed Team, Byte Dance EMAIL Shu Zhong ,* Seed Team, Byte Dance EMAIL |
| Pseudocode | Yes | For pseudo-code of the chunk-wise implementation, refer to Appendix E. Listing 1: Py Torch-style pseudo-code. |
| Open Source Code | Yes | Our code is available at https://github.com/fla-org/flash-linear-attention/blob/main/ fla/layers/deltaformer.py, and in the supplementary materials. |
| Open Datasets | Yes | Following prior work [76], we use open-source code of them and open-source dataset Fineweb-edu for training and the open-source evaluation tool lm-evaluation-harness for benchmark evaluation. The benchmarks that include LAMBADA [LMB.;[50]], Pi QA[8], Hella Swag [Hella.;[81]], Wino Grande [Wino.;[60]], ARC-easy (ARC-e) and ARC-challenge (Arc-c)[12], Boolq [11], Openbook QA [OBQA.;[45]], SIQA [62] and Copa [57]. |
| Dataset Splits | No | The paper mentions a "default context length of 16" for the element tracking task and a curriculum learning strategy (starting with length 32 and gradually increasing to 256). For language modeling, it states "The context length is 2,048 and the global batch size is 0.5M tokens." While context lengths are provided, specific percentages or sample counts for training, validation, and test splits for the Fineweb-edu dataset are not explicitly detailed. |
| Hardware Specification | Yes | Method Time Recurrent 279.9 ms Parallel 102.2 ms Chunk-wise 12.7 ms Table 6: Comparison of execution times with tensor shape [2,32,8192,128] in an H100. |
| Software Dependencies | No | Listing 1 provides PyTorch-style pseudocode using `import torch` and `import torch.nn.functional as F`. However, specific version numbers for Python, PyTorch, or other libraries are not mentioned in the text. |
| Experiment Setup | Yes | For the language modeling task: "We train on a 340M parameter scale with 15B tokens with a peak learning rate of 2e-3. The context length is 2,048 and the global batch size is 0.5M tokens." For element tracking: "a default context length of 16" and a curriculum learning strategy: "starting with length 32 and gradually increasing the window size". |