Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training
Authors: Haonan Wang, Qian Liu, Chao Du, Tongyao Zhu, Cunxiao Du, Kenji Kawaguchi, Tianyu Pang
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on three types of LLMs demonstrate that Anchor Attention significantly improves long-context performance and reduces training time by over 50% compared to standard full attention mechanisms, while preserving the original LLM s capabilities on general tasks. |
| Researcher Affiliation | Collaboration | Haonan Wang EMAIL National University of Singapore; Qian Liu EMAIL Sea AI Lab, Singapore |
| Pseudocode | No | The paper describes methods through textual descriptions and illustrative figures (e.g., Figure 2 illustrating attention paradigms), but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Anchor Context: The implementation of Anchor Attention supports several popular models, using the Flash Attention2 and Flex Attention, and is available at https://github.com/haonan3/Anchor Context. |
| Open Datasets | Yes | We use the Slim Pajama dataset (Soboleva et al., 2023) for long-context training, an open-source replication of the LLa MA pretraining data mixture (Touvron et al., 2023). |
| Dataset Splits | No | The paper describes how it samples tokens for training and uses established benchmarks (RULER, Long Bench, MMLU, Hella Swag), but does not specify explicit train/validation/test splits with percentages or sample counts for its own generated datasets. |
| Hardware Specification | Yes | All models are trained on 8 NVIDIA A100 GPUs. |
| Software Dependencies | Yes | The flexibility of our Anchor Context approach allows for effortless adoption, enabling researchers to incorporate it to much substantial modifications to their codebase... it provides two computational engine options: Flex Attention (which will be natively supported in Py Torch 2.5.0) and Flash Attention. |
| Experiment Setup | Yes | Our training hyperparameters are primarily based on (Zhang, 2023). All models are trained on 8 NVIDIA A100 GPUs. We set the learning rate to 2e-5 and use the Adam W optimizer with weight decay of 0.1, β1 = 0.9, and β2 = 0.95. Each model is trained for 2000 steps, which corresponds to approximately 1 epoch over the 2 billion token dataset. The batch size is set to 8, equating to 0.5 million tokens per batch for 64K context and 1 million tokens for 128K context lengths. |