Exposing Text-Image Inconsistency Using Diffusion Models

Authors: Mingzhen Huang, Shan Jia, Zhou Zhou, Yan Ju, Jialing Cai, Siwei Lyu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate D-TIIL s efficacy, we introduce a new TIIL dataset containing 14K consistent and inconsistent text-image pairs. The main contributions of our work can be summarized as follows: We develop a new method, D-TIIL, that leverages text-to-image diffusion models to expose text-image inconsistency with the location of inconsistent image regions and words; We introduce a new dataset, TIIL, built on real-world image-text pairs from the Visual News dataset, for evaluating text-image inconsistency localization with pixel-level and word-level inconsistency annotations. This section presents a comprehensive analysis of our approach, including qualitative and quantitative results, comparisons with other methods, and ablation studies to evaluate different variations. Table 3: Comparison of text-image inconsistency localization. Table 4: Comparison of detection.
Researcher Affiliation Academia Mingzhen Huang, Shan Jia, Zhou Zhou, Yan Ju, Jialing Cai, Siwei Lyu University at Buffalo, State University of New York
Pseudocode No The paper describes the D-TIIL process in detail through text and diagrams (Fig. 2 and 3) but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Please refer Project Page for source code and dataset. We will only release our code as open-source with the condition that it must not distribute harmful, offensive, dehumanizing content or otherwise harmful representations of people or their environments, cultures, religions, etc. produced with the model weights.
Open Datasets Yes To evaluate D-TIIL s efficacy, we introduce a new TIIL dataset containing 14K consistent and inconsistent text-image pairs. We introduce a new dataset, TIIL, built on real-world image-text pairs from the Visual News dataset (Liu et al., 2020), for evaluating text-image inconsistency localization with pixel-level and word-level inconsistency annotations. Please refer Project Page for source code and dataset.
Dataset Splits No The TIIL dataset consists of approximately 14K image-text pairs, encompassing a total of 7,138 inconsistencies and 7,101 consistent pairs. The paper describes the total size of the dataset and its composition but does not explicitly provide the training, validation, and test splits used for their experiments.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies No The paper mentions using Stable Diffusion and CLIP models but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) needed for replication.
Experiment Setup Yes We train both text embedding Ealn and Ednt for 500 iterations with a learning rate of 4e 6. The hyperparameter γ is set to 8 in our experiment. For noise estimation, we use the same random seed for two conditioned text embeddings, remove outlier values in noise predictions, and average spatial differences over a set of 10 input noises. After obtaining the predicted inconsistent masks, we use a threshold to binarize them where the threshold is the average values among the mask. We only retain the top 3 mask regions with the largest areas. To localize inconsistent words, we follow the previous work (Radford et al., 2021a) to use a prompt template A photo of {words} for CLIP (Radford et al., 2021a) text embedding generation.