Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Residual Stream Analysis of Overfitting And Structural Disruptions
Authors: Quan Liu, Han Zhou, Wenquan Wu, Hua Wu, Sen Su
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first quantify this effect, showing that safety data exhibits substantially lower token entropy (H1 9.18) and 2-gram diversity ( 0.048) compared to general instruction data (H1 12.05, 2-gram 0.205). To uncover the root cause, we introduce Flow Lens, a stable PCA-based tool for residual-stream geometry analysis, and reveal that higher proportions of safety examples concentrate variance along a few components, reducing representational smoothness and driving false refusals (false refusal rate rises from 63% to 84% as safety data increases from 0% to 40%). Guided by these insights, we propose Variance Concentration Loss (VCL), an auxiliary regularizer that penalizes excessive variance concentration in mid-layer residuals. Empirical results demonstrate that VCL reduces false refusals by over 35 percentage points while maintaining or improving performance on general benchmarks such as MMLU and GSM8K. |
| Researcher Affiliation | Collaboration | Quan Liu BUPT EMAIL Han Zhou Baidu EMAIL Wenquan Wu Baidu EMAIL Hua Wu Baidu EMAIL Sen Su BUPT EMAIL |
| Pseudocode | No | The paper describes methods in prose and mathematical formulas (e.g., Ltotal = LSFT + γ LVCL in Section 5) but does not include a structured pseudocode or algorithm block. |
| Open Source Code | Yes | We provide the source code of at the anonymous link https://anonymous.4open.science/r/Code For Paper-3454 |
| Open Datasets | Yes | We use three datasets in this study: WILDJAILBREAK, WILDGUARDMIX, and TULU-MIX, each containing approximately 100,000 safety-aligned completions. We additionally sample 100,000 nonsafety examples from the TULU-3-SFT-MIXTURE-GENERAL dataset as a control group. Appendix B provides additional information about each dataset used in our study, including how completions are constructed, filtered, and organized. WILDJAILBREAK. [...] HF/Wild Jailbreak, WILDGUARDMIX. [...] HF/Wild Guard Mix, TULU-3-SFT-MIXTURE. [...] HF/Tulu-3-SFT-Mixture. |
| Dataset Splits | Yes | False refusals are evaluated on the XSTest benchmark, following the evaluation protocol and decontamination procedure used in the Tülu 3 recipe. Evaluation is conducted on safety benchmarks and general capability tasks using Tülu 3 Evaluation Suite [17]. For safety evaluation, we include DAN, Harm Bench, Toxi Gen, Wild Guard, JBB, and XSTest. For general capabilities, we report performance on MMLU, GSM8K, BBH, and Codex Eval. |
| Hardware Specification | Yes | All experiments were conducted on a 8 NVIDIA A100-80G GPU. |
| Software Dependencies | No | The paper mentions various models and tuning recipes but does not provide specific version numbers for software libraries, programming languages, or environments used for implementation. |
| Experiment Setup | Yes | All models are evaluated under identical decoding settings (greedy decoding, no temperature, max length 512), and results are averaged across tasks in each benchmark category. Specifically, we vary k in {1, 2, 4, 8}, γ in {0.01, 0.1, 1.0, 2.0} 50, and (l1, l2) corresponding to depths [0.1, 0.3], [0.3, 0.5], and [0.5, 0.7]. γ = 1.0 50 yields the best overall performance. Among the residual-window settings, selecting (l1, l2) to correspond to depth [0.3, 0.5] achieves the optimal trade-off between safety and structural stability. |