Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Towards Lifelong Model Editing via Simulating Ideal Editor

Authors: Yaming Guo, Siyang Guo, Hengshu Zhu, Ying Sun

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments validate the effectiveness of Sim IE, which allows standard algorithms to achieve performance comparable to specialized lifelong model editing methods. Our code is available at Sim IE. [...] We conduct experiments on three widely used LLMs: GPT2-XL (1.5B) (Radford et al., 2019), Llama-2 (7B) (Touvron et al., 2023), and Mistral (7B) (Chaplot, 2023). Our experiments include nine popular baselines: the basic fine-tuning method, FTL (Meng et al., 2022), and four standard model editing algorithms, namely MEND (Mitchell et al., 2022a), ROME (Meng et al., 2022), MEMIT (Meng et al., 2023), and Alpha Edit 3 (Fang et al., 2024), along with four lifelong model editing algorithms, specifically GRACE (Hartvigsen et al., 2024), WISE (Wang et al., 2024b), PRUNE (Ma et al., 2024), and Alpha Edit (Fang et al., 2024). These algorithms are evaluated using two widely adopted benchmarks, i.e., the Zs RE dataset (Levy et al., 2017) and the Counterfact dataset (Meng et al., 2022).
Researcher Affiliation	Academia	1Artificial Intelligence Thrust, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China 2School of Intelligent Systems Engineering, Sun Yat-sen University, Shenzhen, China 3Computer Network Information Center, Chinese Academy of Sciences, Beijing, China 4University of Chinese Academy of Sciences, Beijing, China . Correspondence to: Ying Sun <EMAIL>.
Pseudocode	Yes	The pseudo-code for Sim IE is summarized in Algorithm 1.
Open Source Code	No	Our code is available at Sim IE.
Open Datasets	Yes	These algorithms are evaluated using two widely adopted benchmarks, i.e., the Zs RE dataset (Levy et al., 2017) and the Counterfact dataset (Meng et al., 2022).
Dataset Splits	Yes	In line with prior research (Wang et al., 2024b), we assess performance using three key metrics: Rel (Reliability, also known as Edit Success Rate (Hartvigsen et al., 2024)), Gen (Generalization Success Rate), and Loc (Localization Success Rate). We use the Arithmetic Mean Avg = Rel+Gen+Loc 3 as the primary metric, and introduce the Locality-penalized Geometric Mean Geo = eα(Loc 1)(Rel Gen) as a complementary measure. For more details on the experimental setup, please refer to Appendix D.1. [...] We adopt the train/test split from previous work (Wang et al., 2024b; Meng et al., 2022). Except for MEND, which uses the training set to fit the hypernetwork, all other methods perform editing and evaluation directly on the test set.
Hardware Specification	No	The paper discusses various LLMs like GPT2-XL, Llama-2, Mistral, Llama-3, and Qwen2.5 but does not provide any specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments. It only mentions 'To reduce VRAM usage' in Appendix D.3.3 without specifying the hardware providing that VRAM.
Software Dependencies	No	The paper mentions using the knowledge editing framework Easy Edit (Zhang et al., 2024b) and various LLMs (GPT2-XL, Llama-2, Mistral, Llama-3, Qwen2.5), but it does not specify the version numbers for Easy Edit or any other software libraries or frameworks used (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	Specifically, we perform T = 1000 sequential edits on LLMs, with 1 example per edit. [...] For the hyperparameter λ, we conduct a grid search over the set {0.01, 0.1, 1, 5, 10, 30, 50}, with the resulting values summarized in Table 3. [...] We perform 600 sequential edits on GPT2-XL. ROME, Alpha Edit (standard methods), PRUNE, and Alpha Edit (lifelong methods) are selected as baselines for their superior performance on GPT2-XL.