reproducibilityindex.ai

MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions

Authors: Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, Ming-Wei Chang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Magic Lens achieves results comparable with or better than prior best on eight benchmarks of various image retrieval tasks, while maintaining high parameter efficiency with a significantly smaller model size. Additional human analyses on a 1.4M-image unseen corpus further demonstrate the diversity of search intents supported by Magic Lens. Table 1. Performance comparison on five benchmarks of three multimodality-to-image retrieval tasks.
Researcher Affiliation	Collaboration	Kai Zhang * 1 Yi Luan 2 Hexiang Hu 2 Kenton Lee 2 Siyuan Qiao 2 Wenhu Chen 2 Yu Su 1 Ming-Wei Chang 2 *Work done at Google Deep Mind. 1The Ohio State University 2Google Deep Mind. Correspondence to: Kai Zhang <zhang.13253@osu.edu>.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. The methods are described in textual paragraphs and figures illustrating data flow.
Open Source Code	Yes	Code and models are publicly available at the Project Website.
Open Datasets	Yes	Trained on 36.7M (query image, instruction, target image) triplets with rich semantic relations mined from the web. We evaluate Magic Lens backbone encoders on Flickr30k (Plummer et al., 2015) and MSCOCO (Chen et al., 2015). DTIN is constructed from natural images in Image Net (Deng et al., 2009) and images in other domains in Image Net-R (Hendrycks et al., 2021).
Dataset Splits	Yes	Following previous work (Saito et al., 2023; Baldrati et al., 2023; Gu et al., 2024), we evaluate on its validation set and report recall averaged over sub-tasks. The best checkpoints are selected based on the performance on the validation set of CIRR and CIRCO.
Hardware Specification	Yes	We train our base and large models on 64 and 128 TPUs, respectively.
Software Dependencies	No	The paper mentions specific software components and models like Adafactor, Pa LI, Pa LM2, CLIP, and Co Ca. While it refers to Adafactor by citing (Shazeer & Stern, 2018), it does not provide specific version numbers for these software dependencies or the programming language used with its version.
Experiment Setup	Yes	We set image resolution of 288 288 and patch size 18 18. For CLIP-based Magic Lens, we set image resolution of 224 224 and use Vi T-B16 and Vi T-L14. The number of newly added self-attention layers is 4 and the τ is learnable and initialized with 0.07. We set the batch size as 2048 and trained our models for a maximum of 50,000 steps with Adafactor (Shazeer & Stern, 2018) and an early stopping mechanism. The learning rates are set differently for newly-introduced parameters and re-used CLIP or Co Ca parameters, at 2e-5 and 2e-6, respectively.