Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ForgerySleuth: Empowering Multimodal Large Language Models for Image Manipulation Detection

Authors: Zhihao Sun, Haoran Jiang, Haoran Chen, Yixin Cao, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments demonstrate the effectiveness of Forgery Analysis and show that Forgery Sleuth significantly outperforms existing methods in generalization, robustness, and explainability. Extensive experiments on popular benchmarks demonstrate the success of Forgery Sleuth in pixel-level manipulation localization and text-based forgery analysis. Specifically, our approach outperforms the current So TA method by up to 24.7% in pixel-level localization tasks. Moreover, in the Forgery Analysis-Eval comprehensive scoring, our method surpasses the best available model, GPT-4o, achieving an improvement of 35.8%. We conduct an extensive ablation study to analyze the effect of our Forgery Analysis dataset and each component and setting within Forgery Sleuth framework.
Researcher Affiliation Academia Zhihao Sun1,2, Haoran Jiang1,2, Haoran Chen1,2, Yixin Cao1,2, Xipeng Qiu1,2, Zuxuan Wu1,2 , Yu-Gang Jiang1,2 1Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University 2Shanghai Collaborative Innovation Center of Intelligent Visual Computing
Pseudocode No The paper describes the framework and its components using textual descriptions and a diagram (Figure 3), but it does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes https://github.com/sunzhihao18/Forgery Sleuth. We have already made the resources publicly available, including the data, code, and weights, to provide resources for advancing the field.
Open Datasets Yes To ensure a diverse dataset that covers various types of manipulation, we collect 4,000 tampered images from existing IMD datasets, including MIML [34], CASIA2 [35], DEFACTO [36], and Auto Splice [37]. We utilize six publicly accessible test datasets, which are Columbia [55], Coverage [56], CASIA1 [35], NIST16 [57], IMD20 [58], and COCOGlide [30]. For foundational segmentation abilities, we use semantic segmentation datasets such as ADE20k [51] and COCO-Stuff [52].
Dataset Splits Yes The remaining data are used for supervised fine-tuning. To ensure accuracy, we conduct additional cross-validation with more than two experts, selecting 618 samples for evaluation. The remaining data are used for supervised fine-tuning. We refer to this dataset as Forgery Analysis-PT. Table 7 presents the data statistics of the Forgery Analysis dataset. Forgery Analysis-Eval and Forgery Analysis-SFT are initially generated by GPT-4o and fully revised by experts. They are used for evaluating the quality of manipulation analysis generated by the M-LLMs and for the final supervised fine-tuning, respectively. Forgery Analysis-PT is automatically constructed by our proposed data engine, Forgery Analyst, maintaining consistency in data format with the other subsets. Table 7: Forgery Analysis-Eval 618 Forgery Analysis-SFT 1752 Forgery Analysis-PT 50k
Hardware Specification Yes For training, we utilize 2 NVIDIA 80GB A800 GPUs, with training scripts optimized by Deep Speed [62], which helps reduce memory usage and accelerate training.
Software Dependencies No We employ LLa VA-7B-v1-1 [14] as the base multimodal LLM (Fm) and use the Vi T-H SAM [42] backbone for the vision encoder (Fv). For training, we utilize 2 NVIDIA 80GB A800 GPUs, with training scripts optimized by Deep Speed [62], which helps reduce memory usage and accelerate training. We use the Adam W [63] optimizer, setting the learning rate to 0.0002 with no weight decay.
Experiment Setup Yes We use the Adam W [63] optimizer, setting the learning rate to 0.0002 with no weight decay. The learning rate is scheduled using Warmup Decay LR, with 100 warmup iterations. The weights for the text generation loss λtxt and mask loss λmask are both set to 1.0, while the BCE loss λbce and DICE loss λdice are weighted at 1.0 and 0.2, respectively. The batch size per device is 4, with gradient accumulation steps set to 4.