CLIP in Mirror: Disentangling text from visual images through reflection

Authors: Tiancheng Wang, Yuguang Yang, Linlin Yang, Shaohui Lin, Juan Zhang, Guodong Guo, Baochang Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments validate our proposed method. For text-visual disentanglement, the class activation maps (CAMs) [24] show that the disentangled textual and visual features correspond precisely to the regions of text and visual objects, respectively. Using the stable diffusion model [21; 20], visual features generate images similar to the original but without text, while textual features generate textual images (i.e., images only contain text), demonstrating the effectiveness of our method. To quantitatively evaluate the effectiveness of visual feature disentanglement, we compared the state-of-the-art typographic defense methods Defense Prefix [1] in 10 synthetic and 3 real-world typographic attack datasets using disentangled features.
Researcher Affiliation Academia Tiancheng Wang1 Yuguang Yang2 Linlin Yang4 Shaohui Lin5 Juan Zhang1,3 Guodong Guo6 Baochang Zhang1,3 1Institute of Artificial Intelligence, Beihang University, Beijing, China 2School of Electronic Information Engineering, Beihang University, Beijing, China 3Zhongguancun Laboratory, Beijing, China 4State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing, China 5School of Computer Science and Technology, East China Normal University, Shanghai, China 6Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China
Pseudocode No The paper provides mathematical equations and describes procedures, but it does not include explicitly labeled pseudocode blocks or algorithms.
Open Source Code Yes Our code is available at https://github.com/tcwangbuaa/Mirror CLIP.
Open Datasets Yes Clean public classification datasets contain rich visual elements from the real world, which can be used to evaluate the robustness and performance of Mirror CLIP. These include Image Net [4], Caltech101 [6], Oxford Pets [18], Stanford Cars [13], Flowers102 [17], Food101 [2], FGVCAircraft [15], DTD [3], SUN397 [23], and Euro SAT [10].
Dataset Splits No The paper uses the term 'validation' in the context of validating their method's effectiveness through experiments, but it does not specify a 'validation dataset split' used for hyperparameter tuning or model selection.
Hardware Specification Yes All experiments were conducted on NVIDIA A800.
Software Dependencies Yes During the experiments, we used the Vi T-B/32 version of CLIP as a pre-trained model and all parameters of CLIP were frozen... The model we employed for image generation in Section 5.2 is Stable un CLIP [20], a new stable diffusion model fine-tuned at 768 × 768 resolution, based on SD2.1-768 [21].
Experiment Setup Yes During the experiments, we used the Vi T-B/32 version of CLIP as a pre-trained model and all parameters of CLIP were frozen. We informed the CLIP model of our recognition intent by adjusting the text prompt. For text recognition, we used the template 'text of {}' across all datasets. For image recognition, we used the template 'a photo of {}' across all real-world typographic attack datasets, which is shown in Figure 5, and the templates we use across synthetic typographic attack datasets are shown in Table A.