Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning

Authors: Wang Yang, Zirui Liu, Hongye Jin, Qingyu Yin, Vipin Chaudhary, Xiaotian Han

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To rigorously investigate this, we conduct a controlled study comparing language models with identical architectures and fine-tuning data, but varying degrees of long-context pretraining. Our experimental results reveal a consistent and compelling trend: models with stronger long-context capabilities consistently outperform their counterparts on reasoning tasks after SFT.
Researcher Affiliation Academia Wang Yang1, Zirui Liu2, Hongye Jin3, Qingyu Yin Vipin Chaudhary1, Xiaotian Han1 1Case Western Reserve University 2 University of Minnesota Twin Cities 3Texas A&M University EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes methods and experimental setups but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is anonymously available at https://github.com/uservan/LCTMerge.
Open Datasets Yes We utilize the Open R1-Math-220K dataset [8] and divide it into two categories based on response length: short samples (responses within 8K tokens) and long samples (responses ranging from 8K to 16K tokens). For both categories, we sample 20K instances and perform correctness filtering to ensure that each response is factually accurate and correct. These two subsets are then used independently to fine-tune models to improve their reasoning ability. ... Reasoning evaluation. To further evaluate the model s reasoning ability post-training, we use three math benchmarks: MATH500, AIME22 24, and GSM8K.
Dataset Splits No The paper describes the datasets used for fine-tuning and evaluation, such as Open R1-Math-220K, MATH500, AIME22-24, and GSM8K, and mentions sampling 20K instances for fine-tuning, but it does not explicitly detail the training, validation, or test splits for these datasets required for reproduction.
Hardware Specification Yes All models are fine-tuned using four NVIDIA H200 GPUs.
Software Dependencies No We employ the LLa MAFactory library with a batch size of 32, a learning rate of 1.0 10 5 and 3 epochs. The paper mentions the LLaMAFactory library but does not specify its version number.
Experiment Setup Yes All models are fine-tuned using four NVIDIA H200 GPUs. We employ the LLa MAFactory library with a batch size of 32, a learning rate of 1.0 10 5 and 3 epochs.