Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Rope to Nope and Back Again: A New Hybrid Attention Strategy

Authors: Bowen Yang, Bharat Venkitesh, Dwaraknath Gnaneshwar Talupuru, Hangyu Lin, David Cairuz, Phil Blunsom, Acyr Locatelli

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we begin analyzing attention patterns of different attention mechanisms, Ro PE, No PE, and QK-Norm and its impacts on long context performance trained up to 750 billion tokens. Building on these insights, we propose a novel hybrid attention architecture and extensively pretrain up to 5 trillion tokens, followed by supervised fine-tuning on a diverse set of datasets tailored for long context. We show that this architecture surpasses existing state-of-the-art extrapolation-based Ro PE models [47] by a large margin, striking a balance between efficiency and performance.
Researcher Affiliation Industry Bowen Yang1 Bharat Venkitesh1 Dwarak Talupuru1 Hangyu Lin1 David Cairuz1 Phil Blunsom1 Acyr Locatelli1 EMAIL EMAIL
Pseudocode No The paper describes the model architecture and experimental procedures in detail using prose and figures (e.g., Figure 3: RNope-SWA Model Architecture), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We did not open source the data or code for training.
Open Datasets Yes We evaluate the variants on a set of core evaluation benchmarks, including MMLU [30], Hella Swag [79], Commonsense QA [63], ARC [15] for core model capabilities and NIAH benchmark [37] for long context capability. NIAH evaluates a model s ability to retrieve information accurately from a specific sentence (the needle ) embedded within a lengthy document (the haystack ).
Dataset Splits Yes For the SFT stage, we adopt an interleaved training strategy: we combine shortand long context data in a 3:1 ratio, with context lengths of 8192 and 65536 tokens, respectively. We use a batch size of 0.5 million tokens. Supervised Finetuning. ...the finetuning process utilizes interleaved datasets containing 8k and 128k prompt-response pairs.
Hardware Specification No Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [No] Justification: Compute resources required to train varies depending on the type of hardware and frameworks used. It is also not very relevant to the paper s focus.
Software Dependencies No The paper mentions using techniques like Flash Attention [20, 19, 59] and FP8 precision format [51], but it does not specify any particular software libraries with version numbers (e.g., PyTorch 1.x, TensorFlow 2.x, CUDA 11.x).
Experiment Setup Yes We pretrain the model with a batch size of 4 million tokens. We use Adam W with a peak learning rate of 7e 3, a linear warmup of 2000 steps and a cosine learning rate schedule decaying to 3.5e 4 over 179,000 steps for a total of 750 billion tokens. For the SFT stage, we adopt an interleaved training strategy: we combine shortand long context data in a 3:1 ratio, with context lengths of 8192 and 65536 tokens, respectively. We use a batch size of 0.5 million tokens. Pretraining and Cooldown. The models are pretrained for 5 trillion tokens of diverse data with batch size of 8 million tokens using FP8 precision format [51]. We use a cosine learning rate schedule of 5e-3 peak learning rate and 5% end learning rate with 8,000 linear warmup steps. From the pre-trained model, we linearly anneal the learning rate from 2.5e-4 to 1e-6 for 50,000 steps in BF16 precision. The context length was initially maintained at 8k for the first 35,000 steps, then extended to 32k and 128k for 10,000 steps and 5,000 steps, respectively.