Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Hippocampal-like Sequential Editing for Continual Knowledge Updates in Large Language Models

Authors: Quntian Fang, Zhen Huang, Zhiliang Tian, Minghao Hu, Dongsheng Li, Yiping Yao, Xinyue Fang, Menglong Lu, Guotong Geng

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results show that HSE significantly outperforms existing model editing methods across multiple benchmarks. Compared to the best baseline, our approach demonstrates average improvements of 20.6% in generalization, 21.9% in specificity and 17.3% in efficacy. In practical applications, experiments confirm its effectiveness in multi-domain hallucination mitigation, healthcare knowledge injecting, and societal bias reduction.
Researcher Affiliation	Academia	1National Key Laboratory of Parallel and Distributed Computing 2College of System Engineering 3Key Laboratory of Advanced Microprocessor Chips and Systems National University of Defense Technology 4Center of Information Research, AMS EMAIL EMAIL
Pseudocode	Yes	The overall procedure of the HSE method is detailed in the Appx. B B HSE Procedure To facilitate the practical application of HSE, we present a detailed exposition of its specific algorithms and operational procedures. We illustrate the editing process using the most common scenario, oneby-one sequential editing. The process is illustrated in the algorithm 1: Algorithm 1 HSE algorithm
Open Source Code	Yes	Our code is available at HSE_code 3 ... 3https://github.com/Square-Group-Sky/HSE
Open Datasets	Yes	Counterfact dataset [43] presents a challenging cloze task for model editing. Zs RE dataset [45] is a question-answering (QA) dataset designed to evaluate the performance of model editing. Hallu Edit dataset [21] is a meticulously constructed benchmark specifically designed to assess the effectiveness of model editing in rectifying nonfactual information generated by LLMs. Safe Edit dataset [60] is a novel benchmark designed to investigate the detoxification of LLMs through model editing. GLUE benchmark [58] comprises six tasks designed to evaluate the general capabilities of natural language models:
Dataset Splits	Yes	We conduct one-by-one sequential editing with 1,000 samples experiments on four open-source LLMs respectively: Llama3-Instruct (8B), Mistral7B-Instruct-V0.3, GPT-J (6B), and GPT2-XL (1.5B). GLUE Metrics. GLUE employs the F1 score as a unified evaluation metric. For more detailed information, please refer to [58].
Hardware Specification	Yes	We conduct experiments on an A100 80GB GPU.
Software Dependencies	No	The paper does not explicitly provide specific version numbers for software dependencies such as libraries or frameworks.
Experiment Setup	Yes	Llama3-8B-Instruct, Llama3-Aloe-8B-Alpha and Open Bio LLM-8B apply editing to layers [4,5,6,7,8]. Specifically, the update norm of δ is constrained to 0.75 times the norm of the original output representation to ensure controlled modifications. The iterative process for updating δ is capped at a maximum of 25 steps, with a learning rate 1e-1. To manage the trade-off between retaining previous knowledge and incorporating new information, we set the memory factor in Eq. 6 α to 0.8. Additionally, the Fisher information matrix coefficient hyperparameter in Eq. 16 λi is configured to 1e-1, while the hyperparameter λC of C0 is set to 15,000.