Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

IDseq: Decoupled and Sequentially Detecting and Grounding Multi-Modal Media Manipulation

Authors: Runxin Liu, Tian Xie, Jiaming Li, Lingyun Yu, Hongtao Xie

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show the superiority of our IDseq, where it notably outperforms SOTA methods on the fine-grained classification by 3.8% in m AP and the forgery face grounding by 8.7% in Io Umean, even 1.3% in F1 on the most challenging manipulated text grounding. ... We conduct experiments on the DGM4 dataset (Shao, Wu, and Liu 2023), which comprises 230,000 image-text paired samples... Evaluation Metric. We report our results following the original evaluation protocols and metrics (Shao, Wu, and Liu 2023).
Researcher Affiliation Academia 1 University of Science and Technology of China, Hefei, China 2 Anhui University, Hefei, China EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology using text, mathematical formulations, and diagrams (Figure 3, 4, 5) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing source code, nor does it provide a link to a code repository.
Open Datasets Yes We conduct experiments on the DGM4 dataset (Shao, Wu, and Liu 2023), which comprises 230,000 image-text paired samples, including over 77,000 pristine pairs and 152,000 manipulated pairs.
Dataset Splits No We train our IDseq on the training set and evaluate its performance on the test set. ... The input images are resized into 224 × 224, and the text sequence is padded with a max length of 50 for both training and testing. The paper mentions training and test sets but does not provide specific percentages or sample counts for these splits.
Hardware Specification Yes The model is trained on four Nvidia A40 GPUs with batch size 128 for 50 epochs.
Software Dependencies No We implement our model on Py Torch (Paszke et al. 2019). The paper mentions PyTorch as the framework but does not specify a version number or list other key software components with their versions.
Experiment Setup Yes The initial learning rates for encoders and the others are set to 1e-5 and 1e-4 under a cosine schedule. The model is trained on four Nvidia A40 GPUs with batch size 128 for 50 epochs. The input images are resized into 224 × 224, and the text sequence is padded with a max length of 50 for both training and testing. ... where λ1 = 1, λ2 = 1, λ3 = 0.1 and λ4 = 1, λ5 = 0.1, λ6 = 0.1, following the hyperparameter settings of the baseline (Shao, Wu, and Liu 2023).