Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Hippocampal-like Sequential Editing for Continual Knowledge Updates in Large Language Models
Authors: Quntian Fang, Zhen Huang, Zhiliang Tian, Minghao Hu, Dongsheng Li, Yiping Yao, Xinyue Fang, Menglong Lu, Guotong Geng
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results show that HSE significantly outperforms existing model editing methods across multiple benchmarks. Compared to the best baseline, our approach demonstrates average improvements of 20.6% in generalization, 21.9% in specificity and 17.3% in efficacy. In practical applications, experiments confirm its effectiveness in multi-domain hallucination mitigation, healthcare knowledge injecting, and societal bias reduction. |
| Researcher Affiliation | Academia | 1National Key Laboratory of Parallel and Distributed Computing 2College of System Engineering 3Key Laboratory of Advanced Microprocessor Chips and Systems National University of Defense Technology 4Center of Information Research, AMS EMAIL EMAIL |
| Pseudocode | Yes | The overall procedure of the HSE method is detailed in the Appx. B B HSE Procedure To facilitate the practical application of HSE, we present a detailed exposition of its specific algorithms and operational procedures. We illustrate the editing process using the most common scenario, oneby-one sequential editing. The process is illustrated in the algorithm 1: Algorithm 1 HSE algorithm |
| Open Source Code | Yes | Our code is available at HSE_code 3 ... 3https://github.com/Square-Group-Sky/HSE |
| Open Datasets | Yes | Counterfact dataset [43] presents a challenging cloze task for model editing. Zs RE dataset [45] is a question-answering (QA) dataset designed to evaluate the performance of model editing. Hallu Edit dataset [21] is a meticulously constructed benchmark specifically designed to assess the effectiveness of model editing in rectifying nonfactual information generated by LLMs. Safe Edit dataset [60] is a novel benchmark designed to investigate the detoxification of LLMs through model editing. GLUE benchmark [58] comprises six tasks designed to evaluate the general capabilities of natural language models: |
| Dataset Splits | Yes | We conduct one-by-one sequential editing with 1,000 samples experiments on four open-source LLMs respectively: Llama3-Instruct (8B), Mistral7B-Instruct-V0.3, GPT-J (6B), and GPT2-XL (1.5B). GLUE Metrics. GLUE employs the F1 score as a unified evaluation metric. For more detailed information, please refer to [58]. |
| Hardware Specification | Yes | We conduct experiments on an A100 80GB GPU. |
| Software Dependencies | No | The paper does not explicitly provide specific version numbers for software dependencies such as libraries or frameworks. |
| Experiment Setup | Yes | Llama3-8B-Instruct, Llama3-Aloe-8B-Alpha and Open Bio LLM-8B apply editing to layers [4,5,6,7,8]. Specifically, the update norm of δ is constrained to 0.75 times the norm of the original output representation to ensure controlled modifications. The iterative process for updating δ is capped at a maximum of 25 steps, with a learning rate 1e-1. To manage the trade-off between retaining previous knowledge and incorporating new information, we set the memory factor in Eq. 6 α to 0.8. Additionally, the Fisher information matrix coefficient hyperparameter in Eq. 16 λi is configured to 1e-1, while the hyperparameter λC of C0 is set to 15,000. |