Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning to Focus: Causal Attention Distillation via Gradient‐Guided Token Pruning

Authors: Yiju Guo, Wenkai Yang, Zexu Sun, Ning Ding, Zhiyuan Liu, Yankai Lin

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that Lea F not only achieves an absolute improvement in various mathematical reasoning, code generation and multi-hop question answering benchmarks but also effectively suppresses attention to confounding tokens during inference, yielding a more interpretable and reliable reasoning model.
Researcher Affiliation Collaboration Gaoling School of Artificial Intelligence, Renmin University of China Department of Computer Science and Technology, Tsinghua University Baidu Inc.
Pseudocode No The paper describes methods verbally and with figures (e.g., Figure 4: Method Overview) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code Yes Code and data are available at https://github.com/RUCBM/Lea F.
Open Datasets Yes For mathematical reasoning, to ensure the model encounters an equal number of confounding tokens across tasks, we randomly select 30k instances from each of the following subsets in Numina Math-Co T [26]: Olympiads [16], AMC_AIME [26], GSM8K [8], and MATH [17]. For code generation, we randomly select a subset of 120k instances from the Ace Code87k [53] dataset. For multi-hop question answering, we construct the training set by merging the KILT [40] datasets provided in Helmet [52], totaling 3k annotated samples drawn equally from Hotpot QA [50], NQ [1], and Pop QA [36], where each query is explicitly linked to its corresponding gold passages containing the answers.
Dataset Splits Yes For mathematical reasoning, to ensure the model encounters an equal number of confounding tokens across tasks, we randomly select 30k instances from each of the following subsets in Numina Math-Co T [26]: Olympiads [16], AMC_AIME [26], GSM8K [8], and MATH [17]. For code generation, we randomly select a subset of 120k instances from the Ace Code87k [53] dataset. For multi-hop question answering, we construct the training set by merging the KILT [40] datasets provided in Helmet [52], totaling 3k annotated samples drawn equally from Hotpot QA [50], NQ [1], and Pop QA [36]... Validation Set Size (Math) 1035 Validation Set Size (Code) 2000
Hardware Specification Yes Gradient difference computation in Lea F is a one-time offline process on 8 NVIDIA A100 (80GB) GPUs, jointly computing gradients from the teacher and student models. ... Counterfactual responses are generated offline using v LLM on 4 NVIDIA A100 (80GB) GPUs. ... We further measure the end-to-end training overhead over 3 epochs on 4 NVIDIA A100 (80GB) GPUs.
Software Dependencies No The paper mentions using the 'Alpaca-LoRA framework' and 'full-parameter logits knowledge distillation' along with a 'cosine learning rate schedule', but does not provide specific version numbers for these frameworks, libraries, or any underlying software like Python, PyTorch, or CUDA.
Experiment Setup Yes models are trained using the Alpaca-LoRA framework with full-parameter logits knowledge distillation and a cosine learning rate schedule with a maximum learning rate of 10^-5 for three epochs. The batch size is 64 for LLa MA-based models and 32 for Qwen-based models. Detailed hyperparameters and platform information are in Appendix J. ... Table 10: Training hyper-parameters in Knowledge Distillation. ... LR 1e-5 LR Scheduler cosine Batch Size 64 Epochs 3 Maximum Sequence Length 4096 Warmup Steps 5 Distill Loss Type KL