Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Rethinking Residual Distribution in Locate-then-Edit Model Editing

Authors: Xiaopeng Li, Shangwen Wang, Shasha Li, Shezheng Song, Bin Ji, Ma Jun, Jie Yu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Sequential batch editing experiments on three LLMs and two datasets demonstrate that BLUE not only delivers an average performance improvement of 35.59%, significantly advancing the state of the art in model editing, but also enhances the preservation of LLMs general capabilities.
Researcher Affiliation	Academia	Xiaopeng Li Shangwen Wang Shasha Li Shezheng Song Bin Ji Jun Ma Jie Yu National University of Defence Technology EMAIL
Pseudocode	No	The paper describes the proposed BLUE strategy and existing methods textually and through diagrams (e.g., Figure 1), but does not contain a dedicated pseudocode or algorithm block.
Open Source Code	Yes	Our code is available at https://github.com/xpq-tech/BLUE.
Open Datasets	Yes	Our experiments are conducted on two datasets: Counter Fact [4] and zs RE [26].
Dataset Splits	Yes	We randomly sample 2,000 samples from the dataset and perform sequential batch editing with a batch size of 100. Unless otherwise specified, we use the first 200 samples from the Counter Fact dataset.
Hardware Specification	Yes	All our experiments are conducted on A800 GPUs.
Software Dependencies	No	The paper mentions using Large Language Models (LLMs) like Llama3-8B-Instruct, GPT-J (6B), and GPT2-XL, and references their original papers, but does not explicitly state the specific version numbers of software libraries (e.g., PyTorch, TensorFlow, Python) or other dependencies used for implementation.
Experiment Setup	Yes	The critical layers analyzed for each model are: Llama3-8B: {4, 5, 6, 7, 8}, GPT-J (6B): {3, 4, 5, 6, 7, 8} and GPT2-XL: {13, 14, 15, 16, 17}. Also, For Alpha Edit BLUE, we set the α values for Llama3 (8B), GPT-J (6B), and GPT2-XL to 1, 95, and 80, respectively... We randomly sample 2,000 samples from the dataset and perform sequential batch editing with a batch size of 100.