Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Joint Knowledge Editing for Information Enrichment and Probability Promotion

Authors: Wenhang Shi, Yiren Chen, Shuqing Bian, Xinyi Zhang, Zhe Zhao, Pengfei Hu, Wei Lu, Xiaoyong Du

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We rigorously evaluate JEEP by editing up to thousands of facts on various models, i.e., GPT-J (6B) and LLa MA (7B), and addressing diverse editing objectives, i.e., adding factual and counterfactual knowledge. In all tested scenarios, JEEP achieves best performances, validating the effectiveness of the revealings of our probe approach and the designs of our editing method. We conduct extensive experiments involving edits ranging from 1 to 10,000 across various model architectures, including GPT-J (6B) and LLa MA (7B) (Wang and Komatsuzaki 2021; Touvron et al. 2023), and datasets such as zs RE and Multi-COUNTERFACT (Levy et al. 2017; Meng et al. 2022b). In all tested scenarios, JEEP consistently delivers the optimal performances, confirming the effectiveness of our methodological designs and validating our probe approach to identify critical editing stages.
Researcher Affiliation Collaboration Wenhang Shi1, Yiren Chen2, Shuqing Bian3, Xinyi Zhang1*, Zhe Zhao3*, Pengfei Hu3, Wei Lu1, Xiaoyong Du1 1 Renmin University of China 2Peking University 3Tencent EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes methods narratively and using mathematical equations (Eq. 1-12) but does not include a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes Code https://github.com/Eric8932/JEEP
Open Datasets Yes We extract 10,000 real-world factual pairs (x, y ) from zs RE (Levy et al. 2017), a question-answering dataset. We edit 10,000 samples from the Multi-COUNTERFACT dataset (Meng et al. 2022b)
Dataset Splits No The paper mentions using 10,000 samples from zsRE and Multi-COUNTERFACT for editing. However, it does not explicitly provide information on how these samples are split into training, validation, or test sets for the purpose of developing or evaluating the editing method itself, beyond stating the number of samples used for editing.
Hardware Specification No The paper mentions conducting experiments on 'GPT-J (6B)' and 'LLa MA (7B)' models but does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for these experiments.
Software Dependencies No The paper does not explicitly list any software dependencies with specific version numbers (e.g., Python, PyTorch, CUDA versions) that would be needed to replicate the experiment.
Experiment Setup No The paper mentions coefficients for loss terms (β, β', α) and for adaptive updates (γ, γ') as part of the method description. However, specific concrete values for these hyperparameters, or other typical experimental setup details like learning rates, batch sizes, or number of epochs, are not provided in the main text. It mentions 'Implementation details are in Appendix D', suggesting these details are not in the main body of the paper.