CLIP in Mirror: Disentangling text from visual images through reflection
Authors: Tiancheng Wang, Yuguang Yang, Linlin Yang, Shaohui Lin, Juan Zhang, Guodong Guo, Baochang Zhang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments validate our proposed method. For text-visual disentanglement, the class activation maps (CAMs) [24] show that the disentangled textual and visual features correspond precisely to the regions of text and visual objects, respectively. Using the stable diffusion model [21; 20], visual features generate images similar to the original but without text, while textual features generate textual images (i.e., images only contain text), demonstrating the effectiveness of our method. To quantitatively evaluate the effectiveness of visual feature disentanglement, we compared the state-of-the-art typographic defense methods Defense Prefix [1] in 10 synthetic and 3 real-world typographic attack datasets using disentangled features. |
| Researcher Affiliation | Academia | Tiancheng Wang1 Yuguang Yang2 Linlin Yang4 Shaohui Lin5 Juan Zhang1,3 Guodong Guo6 Baochang Zhang1,3 1Institute of Artificial Intelligence, Beihang University, Beijing, China 2School of Electronic Information Engineering, Beihang University, Beijing, China 3Zhongguancun Laboratory, Beijing, China 4State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing, China 5School of Computer Science and Technology, East China Normal University, Shanghai, China 6Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China |
| Pseudocode | No | The paper provides mathematical equations and describes procedures, but it does not include explicitly labeled pseudocode blocks or algorithms. |
| Open Source Code | Yes | Our code is available at https://github.com/tcwangbuaa/Mirror CLIP. |
| Open Datasets | Yes | Clean public classification datasets contain rich visual elements from the real world, which can be used to evaluate the robustness and performance of Mirror CLIP. These include Image Net [4], Caltech101 [6], Oxford Pets [18], Stanford Cars [13], Flowers102 [17], Food101 [2], FGVCAircraft [15], DTD [3], SUN397 [23], and Euro SAT [10]. |
| Dataset Splits | No | The paper uses the term 'validation' in the context of validating their method's effectiveness through experiments, but it does not specify a 'validation dataset split' used for hyperparameter tuning or model selection. |
| Hardware Specification | Yes | All experiments were conducted on NVIDIA A800. |
| Software Dependencies | Yes | During the experiments, we used the Vi T-B/32 version of CLIP as a pre-trained model and all parameters of CLIP were frozen... The model we employed for image generation in Section 5.2 is Stable un CLIP [20], a new stable diffusion model fine-tuned at 768 × 768 resolution, based on SD2.1-768 [21]. |
| Experiment Setup | Yes | During the experiments, we used the Vi T-B/32 version of CLIP as a pre-trained model and all parameters of CLIP were frozen. We informed the CLIP model of our recognition intent by adjusting the text prompt. For text recognition, we used the template 'text of {}' across all datasets. For image recognition, we used the template 'a photo of {}' across all real-world typographic attack datasets, which is shown in Figure 5, and the templates we use across synthetic typographic attack datasets are shown in Table A. |