Towards Unified Multimodal Editing with Enhanced Knowledge Collaboration

Authors: Kaihang Pan, Zhaoyu Fan, Juncheng Li, Qifan Yu, Hao Fei, Siliang Tang, Richang Hong, Hanwang Zhang, QIANRU SUN

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments validate the effectiveness of our method, which ensures that the post-edit MLLM simultaneously maintains excellent reliability, generality, and locality.
Researcher Affiliation Collaboration Zhejiang University1, National University of Singapore2 Hefei University of Technology3, Nanyang Technological University4 Singapore Management University5
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures).
Open Source Code Yes The code for Uni KE is available at https://github.com/beepkh/Uni KE.
Open Datasets Yes Our experiments are conducted on the MMEdit benchmark [4], which contains two subtasks: Editing VQA (E-VQA) and Editing Image Caption (E-IC). ... And the MMEdit benchmark is under MIT license.
Dataset Splits No The paper states it adheres to the testing settings of the MMEdit dataset and leverages a consistent random seed defined in MMEdit during test, but does not explicitly provide the specific training/validation/test dataset splits (percentages or sample counts) within its text.
Hardware Specification Yes completing a single one-step edit takes only a matter of seconds and we run all experiments with 6 NVIDIA RTX A6000 GPUs.
Software Dependencies No The paper mentions using BLIP2-OPT and Mini GPT-4 as backbone models but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, or CUDA versions).
Experiment Setup Yes In intrinsic knowledge editing, we add extra 10 key-value pairs in the FFN of the last four transformer layers; for external knowledge resorting, we retrieve top-40 hidden states of in-context knowledge with the highest similarity for each case and conduct feature shifting for in-context editing in the last four transformer layers. ... During contrastive learning, both encoders are optimized using the Adam optimizer with a learning rate of 1e-4.