MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
Authors: Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, Ming-Wei Chang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Magic Lens achieves results comparable with or better than prior best on eight benchmarks of various image retrieval tasks, while maintaining high parameter efficiency with a significantly smaller model size. Additional human analyses on a 1.4M-image unseen corpus further demonstrate the diversity of search intents supported by Magic Lens. Table 1. Performance comparison on five benchmarks of three multimodality-to-image retrieval tasks. |
| Researcher Affiliation | Collaboration | Kai Zhang * 1 Yi Luan 2 Hexiang Hu 2 Kenton Lee 2 Siyuan Qiao 2 Wenhu Chen 2 Yu Su 1 Ming-Wei Chang 2 *Work done at Google Deep Mind. 1The Ohio State University 2Google Deep Mind. Correspondence to: Kai Zhang <zhang.13253@osu.edu>. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. The methods are described in textual paragraphs and figures illustrating data flow. |
| Open Source Code | Yes | Code and models are publicly available at the Project Website. |
| Open Datasets | Yes | Trained on 36.7M (query image, instruction, target image) triplets with rich semantic relations mined from the web. We evaluate Magic Lens backbone encoders on Flickr30k (Plummer et al., 2015) and MSCOCO (Chen et al., 2015). DTIN is constructed from natural images in Image Net (Deng et al., 2009) and images in other domains in Image Net-R (Hendrycks et al., 2021). |
| Dataset Splits | Yes | Following previous work (Saito et al., 2023; Baldrati et al., 2023; Gu et al., 2024), we evaluate on its validation set and report recall averaged over sub-tasks. The best checkpoints are selected based on the performance on the validation set of CIRR and CIRCO. |
| Hardware Specification | Yes | We train our base and large models on 64 and 128 TPUs, respectively. |
| Software Dependencies | No | The paper mentions specific software components and models like Adafactor, Pa LI, Pa LM2, CLIP, and Co Ca. While it refers to Adafactor by citing (Shazeer & Stern, 2018), it does not provide specific version numbers for these software dependencies or the programming language used with its version. |
| Experiment Setup | Yes | We set image resolution of 288 288 and patch size 18 18. For CLIP-based Magic Lens, we set image resolution of 224 224 and use Vi T-B16 and Vi T-L14. The number of newly added self-attention layers is 4 and the τ is learnable and initialized with 0.07. We set the batch size as 2048 and trained our models for a maximum of 50,000 steps with Adafactor (Shazeer & Stern, 2018) and an early stopping mechanism. The learning rates are set differently for newly-introduced parameters and re-used CLIP or Co Ca parameters, at 2e-5 and 2e-6, respectively. |