Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Reliable Lifelong Multimodal Editing: Conflict-Aware Retrieval Meets Multi-Level Guidance

Authors: Qiang Zhang, Fanrui Zhang, Jiawei Liu, Ming Hu, Junjun He, Zheng-Jun Zha

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 Experiments 4.1 Experimental Setup Datasets and Models. Following [7], our experiments are conducted on the MMEdit benchmark [7], which includes two sub-tasks: Editing VQA (E-VQA) and Editing Image Caption (E-IC). Additionally, we incorporated VLKEB [17] dataset, which consists of real images to better represent real-world scenarios. 4.2 Main Results Competitive Performance of CARML. Tables 1, 2 and 3 present the results of lifelong editing experiments conducted on the MMEdit and VLKEB datasets for CARML and baseline methods. The experimental reveal the following findings: 4.3 Further Analysis Effect of Individual Components. To analyze the efficacy of the core components within CARML, we conduct a detailed ablation study with results presented in Table 5.
Researcher Affiliation	Collaboration	Qiang Zhang1 , Fanrui Zhang1,2 , Jiawei Liu1 , Ming Hu3, Junjun He3, Zheng-Jun Zha1 1Mo E Key Laboratory of Brain-inspired Intelligent Perception and Cognition, USTC 2Shanghai Innovation Institute 3Shanghai Artificial Intelligence Laboratory EMAIL EMAIL EMAIL EMAIL
Pseudocode	Yes	C Pseudo Code of CARML The pseudo-code of the CARML editing stage is in Algorithm 1, and the one of the CARML inference stage is Algorithm 2.
Open Source Code	No	Justification: This paper uses publicly available datasets. And we will release the source code as soon as the paper is accepted.
Open Datasets	Yes	To evaluate the effectiveness of our proposed CARML, we conduct experiments on three multimodal knowledge editing datasets: E-VQA [7], E-IC [7], and VLKEB [17]. E-VQA and E-IC are part of the MMEdit benchmark, while VLKEB includes real-world images to simulate practical scenarios. E-VQA [7]: The E-VQA dataset is derived from the VQAv2 [13] dataset E-IC [7]: The E-IC dataset is constructed from the COCO Caption [5] dataset VLKEB [17]: The Vision-Language Knowledge Editing Benchmark (VLKEB) is a large-scale dataset designed to evaluate the knowledge editing capabilities of MLLMs in realistic scenarios.
Dataset Splits	Yes	Regarding the data split of all three datasets, we followed the setup in the original dataset. The details of each dataset are as follows: E-VQA [7]: ... The dataset comprises 6,346 training samples and 2,093 testing samples... E-IC [7]: ... The dataset includes 2,849 training samples and 1,000 testing samples. VLKEB [17]: ... It contains 8,174 editing instances (5,000 for training and 3,174 for evaluation) and over 18,000 images...
Hardware Specification	Yes	All experiments were conducted on a single NVIDIA H100 GPU.
Software Dependencies	No	The paper mentions "Adam optimizer" and "mexma-siglip2 [36] as the multimodal embedding model", but it does not specify version numbers for these or other software libraries/dependencies.
Experiment Setup	Yes	During training, we used the Adam optimizer with batch size of 8, learning rate of 1e-5, and set the number of epochs to 120. During testing, the threshold β for the edit scope classifier was set to 0.4 for experiments on the E-IC dataset, and 0.7 for the E-VQA and VLKEB datasets. In explicit guidance, the relevance threshold η was set to 6. To comprehensively evaluate the model s performance, we employed reliability, generality (including T-Generality and M-Generality), and locality (including T-Locality and M-Locality) accuracy as evaluation metrics. All experiments were conducted on a single NVIDIA H100 GPU.