The CLIP Model is Secretly an Image-to-Prompt Converter

Authors: Yuxuan Ding, Chunna Tian, Haoxuan Ding, Lingqiao Liu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In contrast, this paper demonstrates that the CLIP model, as utilized in Stable Diffusion, inherently possesses the ability to instantaneously convert images into text prompts. Such an image-to-prompt conversion can be achieved by utilizing a linear projection matrix that is calculated in a closed form. Moreover, the paper showcases that this capability can be further enhanced by either utilizing a small amount of similar-domain training data (approximately 100 images) or incorporating several online training steps (around 30 iterations) on the reference images. By leveraging these approaches, the proposed method offers a simple and flexible solution to bridge the gap between images and text prompts. This methodology can be applied to various tasks such as image variation and image editing, facilitating more effective and seamless interaction between images and textual prompts.
Researcher Affiliation Academia Yuxuan Ding School of Electronic Engineering Xidian University Xi an 710071, China yxding@stu.xidian.edu.cn Chunna Tian School of Electronic Engineering Xidian University Xi an 710071, China chnatian@xidian.edu.cn Haoxuan Ding Unmanned System Research Institute Northwestern Polytechnical University Xi an 710072, China haoxuan.ding@mail.nwpu.edu.cn Lingqiao Liu Australian Institute for Machine Learning The University of Adelaide Adelaide 5005, Australia lingqiao.liu@adelaide.edu.au
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks clearly labeled as such.
Open Source Code No The paper does not provide an explicit statement about releasing its own source code or a link to a code repository for the methodology described.
Open Datasets Yes We use images randomly sampled from Image Net [26], Celeb A-HQ [27], and Places365 [28] dataset to encourage the model extract object, identity, and scene information, respectively. We evaluate image variation on MSCOCO [24] using all 5,000 images in the 2017-split validation set.
Dataset Splits Yes We evaluate image variation on MSCOCO [24] using all 5,000 images in the 2017-split validation set. Each dataset includes 100 images, the test images are non-overlap with the training classes.
Hardware Specification Yes The process took 200,000 GPU-hours on NVIDIA A100-40GB GPU while our approach only requires 1 GPU-hour on NVIDIA A5000-24GB GPU5. SD-IPC-CT only takes 30-iteration updates with around 1 minute on 2 A5000 GPUs while the Custom Diffusion [13] needs 250 iterations (6 minutes on 2 A100 GPUs).
Software Dependencies No The paper mentions using "Stable Diffusion v1.4" and "CLIP Vi T-L/14" models but does not provide specific version numbers for software libraries or frameworks like PyTorch, TensorFlow, or CUDA.
Experiment Setup Yes SD-IPC-FT is trained for 100, 50, and 100 epochs on Image Net [26], Celeb A-HQ [27], and Places365 [28], respectively. The learning rates for all datasets are 1e-5 with cosine decay. Customized generation has a constant learning rate of 5e-6 for 30-iteration updates. Training is conducted on 2 A5000 GPUs. The editing α is set to 0.9.