Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Who Reasons in the Large Language Models?
Authors: Jie Shao, Jianxin Wu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using Sf N, we provide both circumstantial and empirical evidence suggesting that o_proj plays a central role in enabling reasoning, whereas other modules contribute more to fluent dialogue. These findings offer a new perspective on LLM interpretability and open avenues for more targeted training strategies, potentially enabling more efficient and specialized LLMs. |
| Researcher Affiliation | Academia | Jie Shao Jianxin Wu National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China EMAIL, EMAIL |
| Pseudocode | No | The paper only presents mathematical formulas for model components and descriptions of experimental methods, but no structured pseudocode or algorithm blocks are provided. |
| Open Source Code | No | The code associated with this paper will be released as open-source upon acceptance. |
| Open Datasets | Yes | On the AIME 2024 benchmark [19], the merged model M1 achieves level IV performance on several questions that model A cannot solve. As shown in Table 1, the merged model not only yields correct reasoning and answers, but also tends to generate longer and more detailed responses compared to A. |
| Dataset Splits | Yes | We adopt the pipeline of s1 [25] as our baseline, which uses the base model A = Qwen2.5-32B-Instruct and the dataset D = s1K containing 1,000 high-quality reasoning traces. The results are shown in Table 2, where our model F4 corresponds to model B in Assumption 3. |
| Hardware Specification | Yes | All visualization and inference experiments on 1.5B 14B models are conducted on a single NVIDIA A100 GPU. For training and evaluating 32B-70B models, we use a cluster of 8 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions several software components like 'Transformers library', 'lm-evaluation-harness package', 'vLLM', and 'Deep Speed' but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | For the Freeze Stethoscope experiments, we build on the codebase of s1[25]. We use a learning rate of 1e-5, weight decay of 1e-4, a batch size of 16, and train for 5 epochs. |