Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
A$^3$E: Towards Compositional Model Editing
Authors: Hongming Piao, Hao Wang, Dapeng Wu, Ying Wei
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that A3E improves the composability by at least 22.45% without sacrificing the performance of non-compositional model editing. ... We conduct a systematic analysis of existing methods under CME, revealing three pivotal failure modes in composability, including knowledge loss, incorrect preceding and knowledge sinking. ... We conduct large-scale experiments with all instances of the PEAK-CF and PEAK-T datasets, covering four CME tasks: independent multi-answer composition (IMAC), independent multi-question composition (IMQC), multi-answer composition (MAC) and multi-question composition (MQC). |
| Researcher Affiliation | Academia | Hongming Piao City University of Hong Kong EMAIL Hao Wang City University of Hong Kong EMAIL Dapeng Oliver Wu City University of Hong Kong EMAIL Ying Wei Zhejiang University EMAIL |
| Pseudocode | Yes | Please refer to Appendix D for the vector database and Alg. 1-2 for the complete editing progress. ... Algorithm 1 The Edit Training Stage ... Algorithm 2 The Edit Composing Stage |
| Open Source Code | Yes | Extensive experiments demonstrate that A3E improves the composability by at least 22.45% without sacrificing the performance of non-compositional model editing. The code is available at https://github.com/piaohongming/A3E. |
| Open Datasets | Yes | We utilize the PEAK-CF and PEAK-T datasets [33], which contain a large number of questions with multiple answers. ... [33] Jun-Yu Ma, Zhen-Hua Ling, Ningyu Zhang, and Jia-Chen Gu. Neighboring perturbations of knowledge editing on large language models. ar Xiv preprint ar Xiv:2401.17623, 2024. |
| Dataset Splits | Yes | Experimental settings. We conducted analytical experiments using the Llama3-8B model [32] on the PEAK-CF dataset [33] with 50 2-edit composition samples, ... (3) For multi-answer composition, the retained questions and their randomly selected c missing answers form a test instance, where c is the composition number. ... (4) For multi-question composition, we randomly selected c retained questions without repetition and one randomly chosen missing answer per question to form a test instance. |
| Hardware Specification | Yes | The experiments are conducted on a server with 8 NVIDIA RTX 5880 Ada GPUs. |
| Software Dependencies | No | The paper mentions LLMs like Llama3-8B [32] and Mistral-7B [35] but does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | For A3E, we set the unmasked size k to 896 for PEAK-CF and to 448 for PEAK-T. We set the loss weight α to 8 and β to 1 to balance their utilities. We store the output of the 5-th down projection layer at the last subject token as K and hdb. We set the edited FFN layer to 31, the learning rate to 0.01, and train each edit for 50 epochs. To better evaluate the priority of different answers within the model, for all baselines and A3E, we assign a penalty of 10 to the answers that have already been generated. ... For GRACE, MELO and A3E, we use the vector database in Appendix D with the same p = 4 and γ = 0.5. |