Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Rethinking Residual Distribution in Locate-then-Edit Model Editing
Authors: Xiaopeng Li, Shangwen Wang, Shasha Li, Shezheng Song, Bin Ji, Ma Jun, Jie Yu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Sequential batch editing experiments on three LLMs and two datasets demonstrate that BLUE not only delivers an average performance improvement of 35.59%, significantly advancing the state of the art in model editing, but also enhances the preservation of LLMs general capabilities. |
| Researcher Affiliation | Academia | Xiaopeng Li Shangwen Wang Shasha Li Shezheng Song Bin Ji Jun Ma Jie Yu National University of Defence Technology EMAIL |
| Pseudocode | No | The paper describes the proposed BLUE strategy and existing methods textually and through diagrams (e.g., Figure 1), but does not contain a dedicated pseudocode or algorithm block. |
| Open Source Code | Yes | Our code is available at https://github.com/xpq-tech/BLUE. |
| Open Datasets | Yes | Our experiments are conducted on two datasets: Counter Fact [4] and zs RE [26]. |
| Dataset Splits | Yes | We randomly sample 2,000 samples from the dataset and perform sequential batch editing with a batch size of 100. Unless otherwise specified, we use the first 200 samples from the Counter Fact dataset. |
| Hardware Specification | Yes | All our experiments are conducted on A800 GPUs. |
| Software Dependencies | No | The paper mentions using Large Language Models (LLMs) like Llama3-8B-Instruct, GPT-J (6B), and GPT2-XL, and references their original papers, but does not explicitly state the specific version numbers of software libraries (e.g., PyTorch, TensorFlow, Python) or other dependencies used for implementation. |
| Experiment Setup | Yes | The critical layers analyzed for each model are: Llama3-8B: {4, 5, 6, 7, 8}, GPT-J (6B): {3, 4, 5, 6, 7, 8} and GPT2-XL: {13, 14, 15, 16, 17}. Also, For Alpha Edit BLUE, we set the α values for Llama3 (8B), GPT-J (6B), and GPT2-XL to 1, 95, and 80, respectively... We randomly sample 2,000 samples from the dataset and perform sequential batch editing with a batch size of 100. |