Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Residual Stream Analysis of Overfitting And Structural Disruptions

Authors: Quan Liu, Han Zhou, Wenquan Wu, Hua Wu, Sen Su

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We first quantify this effect, showing that safety data exhibits substantially lower token entropy (H1 9.18) and 2-gram diversity ( 0.048) compared to general instruction data (H1 12.05, 2-gram 0.205). To uncover the root cause, we introduce Flow Lens, a stable PCA-based tool for residual-stream geometry analysis, and reveal that higher proportions of safety examples concentrate variance along a few components, reducing representational smoothness and driving false refusals (false refusal rate rises from 63% to 84% as safety data increases from 0% to 40%). Guided by these insights, we propose Variance Concentration Loss (VCL), an auxiliary regularizer that penalizes excessive variance concentration in mid-layer residuals. Empirical results demonstrate that VCL reduces false refusals by over 35 percentage points while maintaining or improving performance on general benchmarks such as MMLU and GSM8K.
Researcher Affiliation	Collaboration	Quan Liu BUPT EMAIL Han Zhou Baidu EMAIL Wenquan Wu Baidu EMAIL Hua Wu Baidu EMAIL Sen Su BUPT EMAIL
Pseudocode	No	The paper describes methods in prose and mathematical formulas (e.g., Ltotal = LSFT + γ LVCL in Section 5) but does not include a structured pseudocode or algorithm block.
Open Source Code	Yes	We provide the source code of at the anonymous link https://anonymous.4open.science/r/Code For Paper-3454
Open Datasets	Yes	We use three datasets in this study: WILDJAILBREAK, WILDGUARDMIX, and TULU-MIX, each containing approximately 100,000 safety-aligned completions. We additionally sample 100,000 nonsafety examples from the TULU-3-SFT-MIXTURE-GENERAL dataset as a control group. Appendix B provides additional information about each dataset used in our study, including how completions are constructed, filtered, and organized. WILDJAILBREAK. [...] HF/Wild Jailbreak, WILDGUARDMIX. [...] HF/Wild Guard Mix, TULU-3-SFT-MIXTURE. [...] HF/Tulu-3-SFT-Mixture.
Dataset Splits	Yes	False refusals are evaluated on the XSTest benchmark, following the evaluation protocol and decontamination procedure used in the Tülu 3 recipe. Evaluation is conducted on safety benchmarks and general capability tasks using Tülu 3 Evaluation Suite [17]. For safety evaluation, we include DAN, Harm Bench, Toxi Gen, Wild Guard, JBB, and XSTest. For general capabilities, we report performance on MMLU, GSM8K, BBH, and Codex Eval.
Hardware Specification	Yes	All experiments were conducted on a 8 NVIDIA A100-80G GPU.
Software Dependencies	No	The paper mentions various models and tuning recipes but does not provide specific version numbers for software libraries, programming languages, or environments used for implementation.
Experiment Setup	Yes	All models are evaluated under identical decoding settings (greedy decoding, no temperature, max length 512), and results are averaged across tasks in each benchmark category. Specifically, we vary k in {1, 2, 4, 8}, γ in {0.01, 0.1, 1.0, 2.0} 50, and (l1, l2) corresponding to depths [0.1, 0.3], [0.3, 0.5], and [0.5, 0.7]. γ = 1.0 50 yields the best overall performance. Among the residual-window settings, selecting (l1, l2) to correspond to depth [0.3, 0.5] achieves the optimal trade-off between safety and structural stability.