Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Reinforced Lifelong Editing for Language Models

Authors: Zherui Li, Houcheng Jiang, Hao Chen, Baolong Bi, Zhenhong Zhou, Fei Sun, Junfeng Fang, Xiang Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive empirical evaluation across several LLMs demonstrates that RLEdit outperforms existing methods in lifelong editing with superior effectiveness and efficiency, achieving a 59.24% improvement while requiring only 2.11% of the time compared to most approaches. Our code is available at: https://github.com/zhrli324/RLEdit. ... We conduct extensive experiments to evaluate both the effectiveness and efficiency of our approach. Additionally, we perform ablation studies to analyze the contribution of each component in RLEdit, which can be found in Appendix B.1.
Researcher Affiliation	Academia	1Beijing University of Posts and Telecommunications 2University of Science and Technology of China 3Institute of Computing Technology, Chinese Academy of Sciences 4National University of Singapore.
Pseudocode	Yes	The pseudo-code is provided in Algorithm 1. ... Algorithm 1 RLEdit Hypernetwork Training ... The corresponding pseudocode for RLEdit s editing algorithms is presented in Algorithm 2.
Open Source Code	Yes	Our code is available at: https://github.com/zhrli324/RLEdit.
Open Datasets	Yes	We evaluate RLEdit on three widely-used datasets: Zs RE (Levy et al., 2017), FEVER (Thorne et al., 2018), and Counter Fact (Meng et al., 2022). Following previous evaluation standards (Mitchell et al., 2022a; Meng et al., 2022; 2023)
Dataset Splits	Yes	We randomly sampled 8,000 knowledge samples from Zs RE and FEVER respectively, performing edits over 400 batches with 20 knowledge samples per batch (denoted as a 400 × 20 configuration throughout this paper). ... For locate-then-edit methods, we use the version from MEMIT; for hypernetwork-based methods, we use the version from MEND, where Zs RE is divided into training and test sets for hypernetwork training and editing performance evaluation respectively.
Hardware Specification	Yes	Most experiments were conducted on a single NVIDIA A100 (80GB) GPU.
Software Dependencies	No	The paper does not provide specific version numbers for software dependencies such as programming languages or libraries.
Experiment Setup	Yes	For the hyperparameters in RLEdit training and editing, we set the memory backtracking decay factor µ to 0.95, the backtracking depth k to 10, the regularization coefficient η to 1e-4 and the discount factor γ to 1 in the total reward formula. Additionally, the initial learning rate was set to 1e-6, while the meta-learning rate was set to 1e-5. The specific hyperparameter configurations for different models and datasets are shown in Table 3