Towards Unified Multimodal Editing with Enhanced Knowledge Collaboration
Authors: Kaihang Pan, Zhaoyu Fan, Juncheng Li, Qifan Yu, Hao Fei, Siliang Tang, Richang Hong, Hanwang Zhang, QIANRU SUN
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments validate the effectiveness of our method, which ensures that the post-edit MLLM simultaneously maintains excellent reliability, generality, and locality. |
| Researcher Affiliation | Collaboration | Zhejiang University1, National University of Singapore2 Hefei University of Technology3, Nanyang Technological University4 Singapore Management University5 |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures). |
| Open Source Code | Yes | The code for Uni KE is available at https://github.com/beepkh/Uni KE. |
| Open Datasets | Yes | Our experiments are conducted on the MMEdit benchmark [4], which contains two subtasks: Editing VQA (E-VQA) and Editing Image Caption (E-IC). ... And the MMEdit benchmark is under MIT license. |
| Dataset Splits | No | The paper states it adheres to the testing settings of the MMEdit dataset and leverages a consistent random seed defined in MMEdit during test, but does not explicitly provide the specific training/validation/test dataset splits (percentages or sample counts) within its text. |
| Hardware Specification | Yes | completing a single one-step edit takes only a matter of seconds and we run all experiments with 6 NVIDIA RTX A6000 GPUs. |
| Software Dependencies | No | The paper mentions using BLIP2-OPT and Mini GPT-4 as backbone models but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, or CUDA versions). |
| Experiment Setup | Yes | In intrinsic knowledge editing, we add extra 10 key-value pairs in the FFN of the last four transformer layers; for external knowledge resorting, we retrieve top-40 hidden states of in-context knowledge with the highest similarity for each case and conduct feature shifting for in-context editing in the last four transformer layers. ... During contrastive learning, both encoders are optimized using the Adam optimizer with a learning rate of 1e-4. |