Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning to Focus: Causal Attention Distillation via Gradient‐Guided Token Pruning

Authors: Yiju Guo, Wenkai Yang, Zexu Sun, Ning Ding, Zhiyuan Liu, Yankai Lin

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that Lea F not only achieves an absolute improvement in various mathematical reasoning, code generation and multi-hop question answering benchmarks but also effectively suppresses attention to confounding tokens during inference, yielding a more interpretable and reliable reasoning model.
Researcher Affiliation	Collaboration	Gaoling School of Artificial Intelligence, Renmin University of China Department of Computer Science and Technology, Tsinghua University Baidu Inc.
Pseudocode	No	The paper describes methods verbally and with figures (e.g., Figure 4: Method Overview) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code	Yes	Code and data are available at https://github.com/RUCBM/Lea F.
Open Datasets	Yes	For mathematical reasoning, to ensure the model encounters an equal number of confounding tokens across tasks, we randomly select 30k instances from each of the following subsets in Numina Math-Co T [26]: Olympiads [16], AMC_AIME [26], GSM8K [8], and MATH [17]. For code generation, we randomly select a subset of 120k instances from the Ace Code87k [53] dataset. For multi-hop question answering, we construct the training set by merging the KILT [40] datasets provided in Helmet [52], totaling 3k annotated samples drawn equally from Hotpot QA [50], NQ [1], and Pop QA [36], where each query is explicitly linked to its corresponding gold passages containing the answers.
Dataset Splits	Yes	For mathematical reasoning, to ensure the model encounters an equal number of confounding tokens across tasks, we randomly select 30k instances from each of the following subsets in Numina Math-Co T [26]: Olympiads [16], AMC_AIME [26], GSM8K [8], and MATH [17]. For code generation, we randomly select a subset of 120k instances from the Ace Code87k [53] dataset. For multi-hop question answering, we construct the training set by merging the KILT [40] datasets provided in Helmet [52], totaling 3k annotated samples drawn equally from Hotpot QA [50], NQ [1], and Pop QA [36]... Validation Set Size (Math) 1035 Validation Set Size (Code) 2000
Hardware Specification	Yes	Gradient difference computation in Lea F is a one-time offline process on 8 NVIDIA A100 (80GB) GPUs, jointly computing gradients from the teacher and student models. ... Counterfactual responses are generated offline using v LLM on 4 NVIDIA A100 (80GB) GPUs. ... We further measure the end-to-end training overhead over 3 epochs on 4 NVIDIA A100 (80GB) GPUs.
Software Dependencies	No	The paper mentions using the 'Alpaca-LoRA framework' and 'full-parameter logits knowledge distillation' along with a 'cosine learning rate schedule', but does not provide specific version numbers for these frameworks, libraries, or any underlying software like Python, PyTorch, or CUDA.
Experiment Setup	Yes	models are trained using the Alpaca-LoRA framework with full-parameter logits knowledge distillation and a cosine learning rate schedule with a maximum learning rate of 10^-5 for three epochs. The batch size is 64 for LLa MA-based models and 32 for Qwen-based models. Detailed hyperparameters and platform information are in Appendix J. ... Table 10: Training hyper-parameters in Knowledge Distillation. ... LR 1e-5 LR Scheduler cosine Batch Size 64 Epochs 3 Maximum Sequence Length 4096 Warmup Steps 5 Distill Loss Type KL